NeurIPS 2021 - Curated papers - Part 2
Link to my deep learning blogs : https://rakshithv-deeplearning.blogspot.com/
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training
This work aims at Vision-Language Pre-training (VLP) or multi-modal learning through pre-training.
1.Given an image and text , image will be passed through Vision Transformer (ViT) and output of the ViT is taken as visual tokens. For text, tokens are generated from BERT tokenizer.
2. [Cls]_[Visual tokens]_[SEP]_[Text tokens] are concatenated and masks are generated similar to BERT.
3. Concatenated text and image token is given to the Multi-modal transformer (MT)
4. Pre-training has 3 objective functions:
a. MLM (Masked Language Modelling) : Almost similar to BERT , predicting the masked token in the text.
b. ITM (Image-Text Matching) : Replacing image with other image with 0.5 probability and binary classification (Same image or different image) is used to learn the inter-modal alignment.
c. MFR (Masked Feature Regression) : It is based on the premise that similar visual tokens have higher attention value which represents strong similarity between it’s semantic properties. So , they pick one random visual token to mask and then pick top-k attention weight tokens to mask and perform L2 regression between them.
Another main contribution is that they have come up with a metric to understand the inter-modality information flow
A(i,j) = Summation of attention weight between image and text in a layer
A(i,i) = Summation of attention weight between same modality image or text alone .
IMF (Inter-modality flow) = A(i,j)/[A(i,j)+A(i,i)]
Take-away is generation of visual token from ViT and they have shown a lot of effective way of using attention weights at different parts of the network
Link to the paper : https://arxiv.org/pdf/2106.13488.pdf
Attention Bottlenecks for Multimodal Fusion
As we all know that transformers has become a go-to recipe for tasks belongs to all modality such as vision, audio and text. Proposed work talks about different way of fusing multi-modal input.
- Fusion via Vanilla Self-Attention
In this method, tokens and embeddings are generated for frames of the video similar to the techniques mentioned in Vision Transformer (ViT) , let’s call this as Zf. Similarly, tokens and embeddings are generated for audio spectrograms and let’s call that as Za. Now both embeddings are concatenated [Zf].[Za] and fed to the transformer . Self-attention is performed between embeddings of frame and audio.
2. Fusion with Modality-specific Parameters
In this method, Both modalities has separate parameters for both audio and frame but interaction between modalities are achieved through cross-attention layers instead of self-attention layers.
3. Fusion via Attention Bottlenecks
In order to avoid quadratic operation of transformer , special tokens called bottleneck fusion tokens are added to the input sequence. Now Z=[Zf,Zbf,Za]. Now attention is performed separately between Zf and Zbf(bottleneck-fusion) and Za and Zbf. So interaction between different modal is done through bottleneck-fusion.
As a whole framework, authors adapt to conventional self-attention at the early layers to focus on uni-modal learning. At the later layers , either of the above 3 fusion strategies can be used for multi-modal connection.
Detailed ablation study about combination of fusion strategy and at what layer to fuse has been given in the paper.
Main take-away from this work is different ways of connecting/fusing multi-modal data
Link to the paper : https://arxiv.org/pdf/2107.00135.pdf
AugMax: Adversarial Composition of Random Augmentations for Robust Training
Goal of the augmentation is to increase the diversity of training data so that it’s generalization capability increases. Authors described about two main categories of augmentation :
- Increasing the diversity-> This includes something as simple as random crop, translation etc. AugMix method achieves better diversity by stochastically picking different augmentation methods.
- Hardness-> Generating adversarial image for original image.
Presented work is basically unification of both as shown below :
Image is passed through 3 chains parallelly and they are combined with learnable parameter “w”. Then, original image and augmented image are combined with learnable parameter “m”. Parameter “m” tries to keep the feature-level similarity between original and augmented image.
Normal training setup is to minimize the loss L by optimizing parameter θ
L = arg min θ [f(x),θ]
Here, if we are training with augmented image x*
x* = g(x-orig,m&w) , g-> AugMax
Therefore, L = arg min θ, arg max (m&w) [f(g(x-orig,m&w),θ]
Hence this is a min-max problem (m and w should try to augment a image in a way where it maximizes model’s loss), soft-max is applied to w since its distribution should be between 0 and 1.
This is an augmentation framework which provides unification between diversity and hardness
Link to the paper : https://arxiv.org/pdf/2110.13771.pdf
Revisiting Model Stitching to Compare Neural Representations
In this work , authors focused to prove the following point as they have mentioned in the paperwe use model stitching to obtain quantitative verifications for intuitive statements such as “good networks learn similar representations” and “more is better”
Underlying assumption is that good models learn similar internal representations even when it’s trained with different initializations, architectures and objectives. This has been shown by stitching together various models trained on same data distribution, identical architectures and different random seeds with minimal stitching loss.
In this process, Let’ s say we have neural network A and neural network B with identical architecture, now earlier layers of A can be stitched to later layers of B and that can be called as stitched model. Stitching layer such as 1x1 conv are learned and used for alignment between different models.(Stitching processing need not to be limited to two models). Better performances was seen when model trained with more data or better hyper parameters in a self-supervised way are stitched together with model trained in a supervised way.
Link to the paper : https://arxiv.org/abs/2106.07682
Comments
Post a Comment