You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

AI-generated keywords: Transformers Object Detection YOLOS ImageNet-1k COCO

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors explore the capability of Transformers in 2D object- and region-level recognition without prior knowledge about spatial structure
  • They propose a series of object detection models called YOLOS based on vanilla Vision Transformer with minimal modifications
  • YOLOS achieves competitive results on COCO object detection benchmark, even without extensive pre-training or architectural changes
  • YOLOS-Base achieves a box Average Precision (AP) of 42.0 on COCO val
  • The paper discusses impacts and limitations of current pre-training schemes and model scaling strategies for Transformers in vision tasks
  • Insights provided on how these factors affect performance and suggestions for future improvement
  • Transformers can effectively handle 2D object- and region-level recognition tasks with minimal knowledge about spatial structure
  • Potential of Transformers in computer vision applications highlighted
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu

NeurIPS 2021 Camera Ready

Abstract: Can Transformer perform 2D object- and region-level recognition from a pure sequence-to-sequence perspective with minimal knowledge about the 2D spatial structure? To answer this question, we present You Only Look at One Sequence (YOLOS), a series of object detection models based on the vanilla Vision Transformer with the fewest possible modifications, region priors, as well as inductive biases of the target task. We find that YOLOS pre-trained on the mid-sized ImageNet-1k dataset only can already achieve quite competitive performance on the challenging COCO object detection benchmark, e.g., YOLOS-Base directly adopted from BERT-Base architecture can obtain 42.0 box AP on COCO val. We also discuss the impacts as well as limitations of current pre-train schemes and model scaling strategies for Transformer in vision through YOLOS. Code and pre-trained models are available at https://github.com/hustvl/YOLOS.

Submitted to arXiv on 01 Jun. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2106.00666v3

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In the paper "You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection," authors Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu and Wenyu Liu explore the capability of Transformers to perform 2D object- and region-level recognition without prior knowledge about the 2D spatial structure. They propose a series of object detection models called You Only Look at One Sequence (YOLOS), which are based on the vanilla Vision Transformer with minimal modifications and inductive biases specific to the target task. The authors conduct experiments using YOLOS pre-trained on the mid-sized ImageNet-1k dataset and evaluate its performance on the challenging COCO object detection benchmark. Surprisingly, even without extensive pre-training or architectural changes, YOLOS achieves competitive results. For instance, YOLOS-Base directly adopted from BERT-Base architecture achieves a box Average Precision (AP) of 42.0 on COCO val. The paper also discusses the impacts and limitations of current pre-training schemes and model scaling strategies for Transformers in vision tasks through the lens of YOLOS. The authors provide insights into how these factors affect performance and suggest future directions for improvement. Overall, this study demonstrates that Transformers can effectively handle 2D object- and region-level recognition tasks with minimal knowledge about spatial structure. The findings highlight the potential of Transformers in computer vision applications and provide valuable insights for further research in this area. Code implementations and pre-trained models are available at https://github.com/hustvl/YOLOS.
Created on 05 Sep. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.