NeurIPS 2021 — Curated papers — Part 1

 UniDoc: Unified Pretraining Framework for Document Understanding

  1. Feature Extraction : Given a document image I and location of document elements, using OCR sentences and it’s corresponding bounding boxes are extracted.
  2. Feature Embedding : For bounding box, features are extracted through CNN backbone+RoIAlign and they are quantized using Gumble-softmax (similar to Wav2Vec2) and embedding for sentences are extracted from pre-trained hierarchical transformers.
  3. Gated cross attention : It’s one of the main ingredient of the work , where cross-modal interaction takes places between text and visual embedding through typical cross-attention mechanism. Now gating is used to combine the representation from both modalities . (Gating is nothing but a learned parameter alpha (between 0 and 1) which determines how embeddings are combined ).
  4. Objective function : There are mainly three parts which constitutes the objective function. a) Masked Sentence Modelling (Unlike words as in the case of BERT). b) Contrastive learning over masked ROI c) Vision-language alignment.
  1. Freeze all the layers of language models
  2. Train a vision-encoder which takes image I and output of it’s pooling layer which has dimension of D * K channels , which are fed as sequence of k embeddings to pre-trained language transformer as a prefix embedding.
  3. Since transformer layers are frozen, gradients from transformer layer are only used to update vision encoder in an auto-regressive way.
  4. So Image and part of the caption is given as an output and label will be remaining part of the label.
  1. Third term of the equation represents reducing the difference in counter-factual between protected and non-protected groups on original data
  2. Second term pushes the cost recourse for the non-protected group for perturbed input ,
    Referring to the above example, small change in the input of non-protected group (Men’ Age) should not result in different explanation hence that brings fairness towards protected group cost recourse.

Comments

Popular posts from this blog

NeurIPS 2021 - Curated papers - Part 2

DALL-E (Zero-Shot Text-to-Image Generation) -PART(2/2)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale