Class-agnostic Object Detection with Multi-modal Transformer

AI-generated keywords: Computer Vision

AI-generated Key Points

  • Existing methods lack a top-down supervision signal based on human-understandable semantics
  • Multi-modal Vision Transformers (MViTs) trained with aligned image-text pairs effectively bridge this gap
  • MViTs showcase state-of-the-art performance in localizing generic objects in images
  • Existing MViTs lack multi-scale feature processing and require longer training schedules
  • Proposed efficient MViT architecture using multi-scale deformable attention and late vision-language fusion
  • MViTs have applications in open-world object detection, salient and camouflage object detection, as well as supervised and self-supervised detection tasks
  • MViTs offer enhanced interactability by adaptively generating proposals based on specific language queries
  • Detailed explanations of existing architectures such as GPV-I and MDETR, highlighting their strengths and weaknesses
  • Proposed MAVL architecture combines multi-scale image features with multi-scale deformable attention modules for improved performance
  • Language structure available in image-caption pairs used for training contributes to the improved performance of MViTs
  • Impressive performance of MViTs in generic object detection across various domains showcased
  • More flexible and efficient MViT architecture developed for off-the-shelf class-agnostic object detection, customizable with different text queries to generate desired proposal sets
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer, Ming-Hsuan Yang

Accepted at ECCV 2022
License: CC ZERO 1.0

Abstract: What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs can adaptively generate proposals given a specific language query and thus offer enhanced interactability. Code: \url{https://git.io/J1HPY}.

Submitted to arXiv on 22 Nov. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2111.11430v6

, , , , In the field of computer vision, the question of what constitutes an object has long been a topic of debate. Various approaches have been developed to score objectness, but they often struggle to scale well across different domains and novel objects. This paper argues that existing methods lack a top-down supervision signal based on human-understandable semantics. For the first time, the authors demonstrate that Multi-modal Vision Transformers (MViTs) trained with aligned image-text pairs can effectively bridge this gap. The authors conduct extensive experiments across various domains and novel objects to showcase the state-of-the-art performance of MViTs in localizing generic objects in images. They also identify that existing MViTs lack multi-scale feature processing and require longer training schedules. To address these limitations, the authors propose an efficient MViT architecture using multi-scale deformable attention and late vision-language fusion. The significance of MViT proposals is demonstrated in a diverse range of applications, including open-world object detection, salient and camouflage object detection, as well as supervised and self-supervised detection tasks. Additionally, MViTs offer enhanced interactability by adaptively generating proposals based on specific language queries. The paper provides detailed explanations of existing architectures such as GPV-I and MDETR, highlighting their strengths and weaknesses. The proposed MAVL architecture combines multi-scale image features with multi-scale deformable attention modules for improved performance. Through systematic experiments, the authors analyze the factors contributing to the improved performance of MViTs. They emphasize the role of language structure available in image-caption pairs used for training. In conclusion, this paper showcases the impressive performance of MViTs in generic object detection across various domains. The authors develop a more flexible and efficient MViT architecture for off-the-shelf class-agnostic object detection, which can be customized with different text queries to generate desired proposal sets. The use-cases for class-agnostic proposals are explored extensively, demonstrating their potential in improving performance in different scenarios.
Created on 15 Jan. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.