MPViT: Multi-Path Vision Transformer for Dense Prediction

AI-generated keywords: Multi-Path Vision Transformer

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors address the need for effective multi-scale feature representation in dense computer vision tasks
  • ViTs have emerged as a potential replacement for CNNs in these tasks
  • The authors propose MPViT, which combines multi-scale patch embedding and a multi-path structure
  • MPViT allows tokens of varying scales to be independently fed into Transformer encoders via multiple paths
  • MPViTs consistently outperform state-of-the-art Vision Transformers across various tasks
  • MPViT is versatile as a backbone network for a wide range of vision tasks
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Youngwan Lee, Jonghee Kim, Jeff Willette, Sung Ju Hwang

technical report
License: CC BY-NC-ND 4.0

Abstract: Dense computer vision tasks such as object detection and segmentation require effective multi-scale feature representation for detecting or classifying objects or regions with varying sizes. While Convolutional Neural Networks (CNNs) have been the dominant architectures for such tasks, recently introduced Vision Transformers (ViTs) aim to replace them as a backbone. Similar to CNNs, ViTs build a simple multi-stage structure (i.e., fine-to-coarse) for multi-scale representation with single-scale patches. In this work, with a different perspective from existing Transformers, we explore multi-scale patch embedding and multi-path structure, constructing the Multi-Path Vision Transformer (MPViT). MPViT embeds features of the same size~(i.e., sequence length) with patches of different scales simultaneously by using overlapping convolutional patch embedding. Tokens of different scales are then independently fed into the Transformer encoders via multiple paths and the resulting features are aggregated, enabling both fine and coarse feature representations at the same feature level. Thanks to the diverse, multi-scale feature representations, our MPViTs scaling from tiny~(5M) to base~(73M) consistently achieve superior performance over state-of-the-art Vision Transformers on ImageNet classification, object detection, instance segmentation, and semantic segmentation. These extensive results demonstrate that MPViT can serve as a versatile backbone network for various vision tasks. Code will be made publicly available at \url{https://git.io/MPViT}.

Submitted to arXiv on 21 Dec. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2112.11010v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "MPViT: Multi-Path Vision Transformer for Dense Prediction," authors Youngwan Lee, Jonghee Kim, Jeff Willette, and Sung Ju Hwang address the need for effective multi-scale feature representation in dense computer vision tasks such as object detection and segmentation. Traditionally, Convolutional Neural Networks (CNNs) have been the go-to architecture for these tasks. However, ViTs have emerged as a potential replacement. ViTs utilize a simple multi-stage structure to achieve multi-scale representation with single-scale patches. To improve upon this approach, the authors propose a novel method with their Multi-Path Vision Transformer (MPViT), which combines multi-scale patch embedding and a multi-path structure. MPViT embeds features of the same size using patches of different scales simultaneously through overlapping convolutional patch embedding. This allows tokens of varying scales to be independently fed into Transformer encoders via multiple paths, resulting in aggregated features that enable both fine and coarse representations at the same feature level. The diverse multi-scale feature representations offered by MPViTs ranging from tiny (5M) to base (73M) consistently outperform state-of-the-art Vision Transformers across various tasks such as ImageNet classification, object detection, instance segmentation, and semantic segmentation. These results highlight the versatility of MPViT as a backbone network for a wide range of vision tasks. The authors plan to make their code publicly available at https://git.io/MPViT. Overall, this work presents an innovative approach to enhancing multi-scale feature representation in computer vision tasks through the development of MPViT, showcasing its superior performance compared to existing Vision Transformers.
Created on 06 Sep. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.