BEIT: BERT Pre-Training of Image Transformers

 We are all aware of the fact that how successful was BERT for NLP applications. BERT was inspired from a transformer architecture and recently there has been a significant success in Vision, Audio as well using transformers. In the last year, we have seen lot of work in Vision domain (DINO, Image is worth of 16*16 words,etc) related to usage of transformers. One of the key ideas is that using image token as a text token in analogous to NLP.

  1. Image is divided into grids(token).
  2. Blocks of token are masked randomly.
  3. Flatten the image patch into a vector.
  4. Positional embeddings and embeddings are learned for the patches.
  5. Now these embeddings are passed through BERT like architecture.
  6. For masked part, model has to predict image token.
  7. These tokens come from image tokenizer.
  8. Finally, image data can be reconstructed using tokens.

Comments

Popular posts from this blog

NeurIPS 2021 - Curated papers - Part 2

DALL-E (Zero-Shot Text-to-Image Generation) -PART(2/2)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale