Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

AI-generated keywords: State space models Vision Mamba Efficient hardware-aware designs Bidirectional Mamba blocks Visual representation learning

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • State space models (SSMs) with efficient hardware-aware designs, like Mamba, have shown promise in long sequence modeling
  • Vision Mamba (Vim) is a new approach that builds efficient and generic vision backbones using SSMs
  • Vim incorporates bidirectional Mamba blocks into the architecture instead of relying solely on self-attention for visual representation learning
  • Vim marks image sequences with position embeddings and compresses visual representations using bidirectional state space models
  • Vim achieves higher performance compared to well-established vision transformers like DeiT on tasks such as ImageNet classification, COCO object detection, and ADE20k semantic segmentation
  • Vim demonstrates significantly improved computation and memory efficiency, being 2.8 times faster than DeiT and saving 86.8% GPU memory during batch inference on high-resolution images (1248x1248 resolution)
  • Vim overcomes computation and memory constraints while performing Transformer-style understanding for high-resolution images
  • Code for implementing Vim can be found at https://github.com/hustvl/Vim
  • Vim offers a promising solution for efficient visual representation learning in computer vision tasks and has the potential to become the next-generation backbone for vision foundation models
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, Xinggang Wang

Work in progress. Code is available at https://github.com/hustvl/Vim

Abstract: Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., Mamba, have shown great potential for long sequence modeling. Building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance of visual representation learning on self-attention is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8$\times$ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248$\times$1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to become the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim.

Submitted to arXiv on 17 Jan. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2401.09417v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In recent years, state space models (SSMs) with efficient hardware-aware designs, such as Mamba, have demonstrated promising results in long sequence modeling. Expanding on this idea, researchers propose a new approach called Vision Mamba (Vim) to build efficient and generic vision backbones using SSMs. The authors argue that relying solely on self-attention for visual representation learning is not necessary and introduce Vim which incorporates bidirectional Mamba blocks into the architecture. Vim marks image sequences with position embeddings and compresses visual representations using bidirectional state space models. The proposed backbone achieves higher performance compared to well-established vision transformers like DeiT on tasks such as ImageNet classification, COCO object detection, and ADE20k semantic segmentation. Additionally, Vim demonstrates significantly improved computation and memory efficiency. For instance, Vim is 2.8 times faster than DeiT and saves 86.8% GPU memory when performing batch inference on high-resolution images (1248x1248 resolution). These results highlight Vim's ability to overcome computation and memory constraints while performing Transformer-style understanding for high-resolution images. The authors provide code for Vim implementation at https://github.com/hustvl/Vim. With its potential to become the next-generation backbone for vision foundation models, Vim offers a promising solution for efficient visual representation learning in computer vision tasks.
Created on 19 Jan. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.