Simple Open-Vocabulary Object Detection with Vision Transformers

AI-generated keywords: Image-Text Models

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Combination of simple architectures and large-scale pre-training has revolutionized image classification
  • Pre-training and scaling approaches are not well-established in object detection, especially in long-tailed and open-vocabulary settings
  • Authors propose a robust methodology for transferring image-text models to open-vocabulary object detection
  • Standard Vision Transformer architecture with minimal modifications is used, along with contrastive image-text pre-training and end-to-end detection fine-tuning
  • Increasing image-level pre-training and model size consistently improves the downstream detection task
  • Adaptation strategies and regularizations are provided for exceptional performance in zero-shot text-conditioned and one-shot image-conditioned object detection scenarios
  • Paper presents a strong recipe for applying image-text models to open-vocabulary object detection tasks
  • Effectiveness of combining Vision Transformers with contrastive pre-training and fine-tuning techniques is showcased
  • Scaling up both pre-training data and model size leads to improved performance
  • Code and models are available on GitHub for researchers interested in exploring these methods.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby

ECCV 2022 camera-ready version

Abstract: Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub.

Submitted to arXiv on 12 May. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2205.06230v2

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

, , , , In recent years, the combination of simple architectures and large-scale pre-training has revolutionized image classification. However, when it comes to object detection, the use of pre-training and scaling approaches is not as well-established. This is particularly true in the long-tailed and open-vocabulary setting where training data is limited. To address this gap, the authors propose a robust methodology for transferring image-text models to open-vocabulary object detection. They employ a standard Vision Transformer architecture with minimal modifications and leverage contrastive image-text pre-training followed by end-to-end detection fine-tuning. The authors conduct an extensive analysis of the scaling properties of this setup and demonstrate that increasing image-level pre-training and model size consistently leads to improvements in the downstream detection task. They also provide adaptation strategies and regularizations necessary to achieve exceptional performance in zero-shot text-conditioned and one-shot image-conditioned object detection scenarios. Overall, this paper presents a strong recipe for applying image-text models to open-vocabulary object detection tasks. The proposed methodology showcases the effectiveness of combining Vision Transformers with contrastive pre-training and fine-tuning techniques. The authors' thorough analysis highlights the benefits of scaling up both pre-training data and model size for improved performance. Researchers interested in exploring these methods can access the code and models on GitHub.
Created on 15 Jan. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.