Simple Open-Vocabulary Object Detection with Vision Transformers

AI-generated keywords: Vision Transformer Pre-training Object Detection Scaling Properties OpenImages

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • The paper proposes a recipe for transferring image-text models to open-vocabulary object detection
  • Simple architectures with large-scale pre-training have led to significant improvements in image classification, but pre-training and scaling approaches for object detection are less well established
  • The authors use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning
  • Increasing image-level pre-training and model size yield consistent improvements on the downstream detection task
  • Adaptation strategies and regularizations are provided to attain very strong performance on zero-shot text conditioned and one shot image conditioned object detection
  • The proposed approach achieves state of the art results on several benchmark datasets including COCO, LVIS, and Objects365
  • It demonstrates its effectiveness in low resource settings by achieving competitive performance on the OpenImages dataset with only 10% of its training data
  • Code and models are available on GitHub for reproducibility purposes.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby

Abstract: Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub.

Submitted to arXiv on 12 May. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2205.06230v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The paper "Simple Open-Vocabulary Object Detection with Vision Transformers" proposes a recipe for transferring image-text models to open-vocabulary object detection. The authors note that while combining simple architectures with large-scale pre-training has led to significant improvements in image classification, pre-training and scaling approaches for object detection are less well established, particularly in the long-tailed and open-vocabulary setting where training data is relatively scarce. To address this gap, the authors use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. The analysis of the scaling properties of this setup reveals that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. Furthermore, the authors provide adaptation strategies and regularizations needed to attain very strong performance on zero-shot text conditioned and one shot image conditioned object detection. The proposed approach achieves state of the art results on several benchmark datasets including COCO, LVIS, and Objects365. Additionally, it demonstrates its effectiveness in low resource settings by achieving competitive performance on the OpenImages dataset with only 10% of its training data. Overall, this paper presents a robust recipe for transferring image text models to open vocabulary object detection that achieves state of the art results while requiring minimal modifications to existing architectures. Code and models are available on GitHub for reproducibility purposes.
Created on 03 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.