Masked Autoencoders Are Scalable Vision Learners
AI-generated Key Points
- The paper presents a simple and scalable approach to self-supervised learning for computer vision using masked autoencoders (MAE).
- The MAE approach involves masking random patches of the input image and reconstructing the missing pixels.
- The authors develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
- Masking a high proportion of the input image, such as 75%, yields a nontrivial and meaningful self-supervisory task.
- Their approach enables efficient and effective training of large models, leading to improved accuracy.
- A vanilla ViT-Huge model trained using their method achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data.
- Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
- Simple self-supervised learning methods like MAE can provide scalable benefits in computer vision as well.
- Comparing with BEiT [2], their MAE approach achieves comparable results while being simpler and more computationally efficient.
- They demonstrate state-of-the-art performance on several benchmarks without relying on external data or supervision.
Authors: Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick
Abstract: This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.