Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

AI-generated keywords: NaViT Vision Transformers CNN-designed Flexibility Robustness

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • The authors present NaViT (Native Resolution ViT), a model that utilizes Vision Transformers (ViTs) for processing inputs of arbitrary resolutions and aspect ratios.
  • NaViT represents a departure from the standard input and modeling pipeline used by most computer vision models, which are typically CNN-designed.
  • NaViT improves training efficiency for large-scale supervised and contrastive image-text pretraining.
  • It can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation.
  • NaViT leads to improved results on robustness and fairness benchmarks.
  • The flexibility in handling varying input sequence lengths and resolutions makes NaViT a promising direction for ViTs.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Gritsenko, Mario Lučić, Neil Houlsby

Abstract: The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.

Submitted to arXiv on 12 Jul. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2307.06304v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The authors of the paper, Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim Alabdulmohsin, Avital Oliver, Piotr Padlewski Alexey Gritsenko Mario Lučić and Neil Houlsby present NaViT (Native Resolution ViT), a model that takes advantage of the flexible sequence-based modeling offered by Vision Transformers (ViTs) to process inputs of arbitrary resolutions and aspect ratios. This represents a departure from the standard input and modeling pipeline used by most computer vision models which are typically CNN-designed. NaViT is shown to improve training efficiency for large-scale supervised and contrastive image-text pretraining. Furthermore it can be efficiently transferred to standard tasks such as image and video classification object detection and semantic segmentation leading to improved results on robustness and fairness benchmarks. The flexibility in handling varying input sequence lengths and resolutions make NaViT a promising direction for ViTs.
Created on 27 Jul. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.