Masked Autoencoders Are Scalable Vision Learners

AI-generated keywords: Self-supervised Learning Masked Autoencoders Asymmetric Encoder-Decoder Architecture ImageNet-1K Data Transfer Performance

AI-generated Key Points

The paper presents a simple and scalable approach to self-supervised learning for computer vision using masked autoencoders (MAE).
The MAE approach involves masking random patches of the input image and reconstructing the missing pixels.
The authors develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
Masking a high proportion of the input image, such as 75%, yields a nontrivial and meaningful self-supervisory task.
Their approach enables efficient and effective training of large models, leading to improved accuracy.
A vanilla ViT-Huge model trained using their method achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data.
Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
Simple self-supervised learning methods like MAE can provide scalable benefits in computer vision as well.
Comparing with BEiT [2], their MAE approach achieves comparable results while being simpler and more computationally efficient.
They demonstrate state-of-the-art performance on several benchmarks without relying on external data or supervision.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick

arXiv: 2111.06377v1 - DOI (cs.CV)

Tech report

License: CC BY 4.0

Abstract: This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

Submitted to arXiv on 11 Nov. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2111.06377v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

This paper presents a simple and scalable approach to self-supervised learning for computer vision using masked autoencoders (MAE). The MAE approach involves masking random patches of the input image and reconstructing the missing pixels. The authors develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. They find that masking a high proportion of the input image, such as 75%, yields a nontrivial and meaningful self-supervisory task. The authors demonstrate that their approach enables efficient and effective training of large models, leading to improved accuracy. For instance, a vanilla ViT-Huge model trained using their method achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Moreover, transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior. The authors note that while images and languages are signals of different nature, similar to NLP, simple self-supervised learning methods can provide scalable benefits in computer vision as well. Instead of attempting to remove objects from images, they remove random patches that most likely do not form a semantic segment. Likewise, their MAE reconstructs pixels rather than generating images. Comparing with BEiT [2], their MAE approach achieves comparable results while being simpler and more computationally efficient. Additionally, they demonstrate state-of-the-art performance on several benchmarks without relying on external data or supervision. Overall, this study highlights the potential for simple self-supervised learning methods like MAE to enable efficient training of large models in computer vision without relying on external supervision or data augmentation techniques.

- The paper presents a simple and scalable approach to self-supervised learning for computer vision using masked autoencoders (MAE).
- The MAE approach involves masking random patches of the input image and reconstructing the missing pixels.
- The authors develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
- Masking a high proportion of the input image, such as 75%, yields a nontrivial and meaningful self-supervisory task.
- Their approach enables efficient and effective training of large models, leading to improved accuracy.
- A vanilla ViT-Huge model trained using their method achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data.
- Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
- Simple self-supervised learning methods like MAE can provide scalable benefits in computer vision as well.
- Comparing with BEiT [2], their MAE approach achieves comparable results while being simpler and more computationally efficient.
- They demonstrate state-of-the-art performance on several benchmarks without relying on external data or supervision.

This paper talks about a new way for computers to learn how to see things better. They use something called masked autoencoders (MAE) which means they hide parts of pictures and try to guess what's missing. The computer has two parts, one that looks at the picture and one that tries to recreate it. When they hide a lot of the picture, it makes the task harder but also helps the computer learn better. This new way of learning helps make big computers work faster and better than before. It even works better than other ways of teaching computers with pictures! Definitions: - Self-supervised learning: when a computer learns by itself without someone telling it what to do - Computer vision: when a computer can "see" things in pictures or videos - Masked autoencoders (MAE): a type of computer program that hides part of an image and tries to fill in what's missing - Encoder-decoder architecture: the two parts of the program that look at the picture and try to recreate it - Latent representation: a way for the computer to remember important information about what it saw

Masked Autoencoders for Self-Supervised Learning in Computer Vision

Computer vision has made tremendous progress in recent years, and self-supervised learning (SSL) is an increasingly popular approach to training models without relying on external supervision. In this paper, the authors present a simple and scalable approach to SSL for computer vision using masked autoencoders (MAE). The MAE approach involves masking random patches of the input image and reconstructing the missing pixels. This study highlights the potential of simple self-supervised learning methods like MAE to enable efficient training of large models in computer vision without relying on external supervision or data augmentation techniques.

Background

Self-supervised learning (SSL) is an increasingly popular technique for training machine learning models without requiring manual labeling or other forms of external supervision. It has been used extensively in natural language processing (NLP), where it has enabled impressive performance gains with minimal effort. However, its application to computer vision tasks has been limited due to the difficulty of designing meaningful self-supervisory tasks that can be applied at scale.

Methodology

The authors develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. They find that masking a high proportion of the input image, such as 75%, yields a nontrivial and meaningful self-supervisory task. The authors demonstrate that their approach enables efficient and effective training of large models, leading to improved accuracy. For instance, a vanilla ViT-Huge model trained using their method achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Moreover, transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

Comparison With Existing Methods

The authors note that while images and languages are signals of different nature, similar to NLP, simple self-supervised learning methods can provide scalable benefits in computer vision as well. Instead of attempting to remove objects from images, they remove random patches that most likely do not form a semantic segment. Likewise, their MAE reconstructs pixels rather than generating images. Comparing with BEiT [2], their MAE approach achieves comparable results while being simpler and more computationally efficient Additionally, they demonstrate state-of-the art performance on several benchmarks without relying on external data or supervision..

Conclusion

Overall this paper presents a novel approach for applying SSL techniques to computer vision tasks using masked autoencoders which is both simple yet effective at improving model accuracy compared with existing approaches which rely heavily on external data or supervision . By demonstrating state -of -the -art performance across multiple benchmarks , this research highlights how powerful these techniques can be when applied correctly .

Created on 22 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

61.1%

An Empirical Study of Training Self-Supervised Visual Transformers

cs.CV

58.3%

Learning Human Motion Representations: A Unified Perspective

cs.CV

57.3%

data2vec: A General Framework for Self-supervised Learning in Speech, Vision …

cs.LG

56.2%

Generative Semantic Segmentation

cs.CV

55.8%

Localized Region Contrast for Enhancing Self-Supervised Learning in Medical I…

cs.CV

55.5%

Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes

cs.LG

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.