Class-agnostic Object Detection with Multi-modal Transformer
AI-generated Key Points
- Existing methods lack a top-down supervision signal based on human-understandable semantics
- Multi-modal Vision Transformers (MViTs) trained with aligned image-text pairs effectively bridge this gap
- MViTs showcase state-of-the-art performance in localizing generic objects in images
- Existing MViTs lack multi-scale feature processing and require longer training schedules
- Proposed efficient MViT architecture using multi-scale deformable attention and late vision-language fusion
- MViTs have applications in open-world object detection, salient and camouflage object detection, as well as supervised and self-supervised detection tasks
- MViTs offer enhanced interactability by adaptively generating proposals based on specific language queries
- Detailed explanations of existing architectures such as GPV-I and MDETR, highlighting their strengths and weaknesses
- Proposed MAVL architecture combines multi-scale image features with multi-scale deformable attention modules for improved performance
- Language structure available in image-caption pairs used for training contributes to the improved performance of MViTs
- Impressive performance of MViTs in generic object detection across various domains showcased
- More flexible and efficient MViT architecture developed for off-the-shelf class-agnostic object detection, customizable with different text queries to generate desired proposal sets
Authors: Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer, Ming-Hsuan Yang
Abstract: What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs can adaptively generate proposals given a specific language query and thus offer enhanced interactability. Code: \url{https://git.io/J1HPY}.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.