Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

AI-generated keywords: Next-ViT NCB NTB Convolutional Layers Transformers

AI-generated Key Points

Most existing vision Transformers (ViTs) struggle to match the efficiency of convolutional neural networks (CNNs) in realistic industrial scenarios
Hybrid architectures combining CNNs and Transformers have not yielded satisfactory results
Next-ViT is a next-generation vision Transformer that outperforms both CNNs and ViTs in terms of latency/accuracy trade-off
Next-ViT introduces two key components: the Next Convolution Block (NCB) and the Next Transformer Block (NTB)
NCB captures local information using deployment-friendly mechanisms, while NTB captures global information and combines it with local information for enhanced performance
Next Hybrid Strategy (NHS) efficiently stacks NCB and NTB to improve performance across various vision tasks
Extensive experiments demonstrate that Next-ViT surpasses existing CNNs, ViTs, and hybrid architectures in terms of latency/accuracy trade-off on TensorRT and CoreML platforms
Next-ViT models are designed specifically for realistic industrial scenarios and offer improved efficiency compared to previous methods
Other hybrid architectures like BoTNet, CvT, CMT, Mobile-ViT, Mobile-Former, and EfficientFormer have been proposed in related work
Core designs within Next-ViT include NCB and NTB blocks for information interaction and modeling short-term/long-term dependencies in visual data
Fusion of local and global information is performed in the NTB block to enhance modeling capability
The proposed Next Hybrid Strategy integrates convolutional layers with Transformer blocks innovatively to overcome limitations posed by existing methods

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jiashi Li, Xin Xia, Wei Li, Huixia Li, Xing Wang, Xuefeng Xiao, Rui Wang, Min Zheng, Xin Pan

arXiv: 2207.05501v4 - DOI (cs.CV)

License: CC BY 4.0

Abstract: Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, Next Hybrid Strategy (NHS) is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, Next-ViT surpasses ResNet by 5.5 mAP (from 40.4 to 45.9) on COCO detection and 7.7% mIoU (from 38.8% to 46.5%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6x. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.1% to 48.6%) on ADE20K segmentation under similar latency. Our code and models are made public at: https://github.com/bytedance/Next-ViT

Submitted to arXiv on 12 Jul. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2207.05501v4

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the field of computer vision, most existing vision Transformers (ViTs) struggle to match the efficiency of convolutional neural networks (CNNs) in realistic industrial scenarios. This has led researchers to explore hybrid architectures that combine CNNs and Transformers, but these approaches have not yielded satisfactory results. To address this challenge, the authors propose a next-generation vision Transformer called Next-ViT that outperforms both CNNs and ViTs in terms of latency/accuracy trade-off. Next-ViT introduces two key components: the Next Convolution Block (NCB) and the Next Transformer Block (NTB). The NCB captures local information using deployment-friendly mechanisms, while the NTB captures global information and combines it with local information for enhanced performance. The authors also develop the Next Hybrid Strategy (NHS), which efficiently stacks NCB and NTB to improve performance across various vision tasks. Extensive experiments demonstrate that Next-ViT surpasses existing CNNs, ViTs, and CNN-Transformer hybrid architectures in terms of latency/accuracy trade-off. On TensorRT, Next-ViT achieves significant improvements over ResNet on COCO detection and ADE20K segmentation tasks while maintaining similar latency. On CoreML, it outperforms EfficientFormer on the same tasks under similar latency conditions. The proposed Next-ViT models are designed specifically for realistic industrial scenarios and offer improved efficiency compared to previous methods. The code and models are publicly available for further exploration. In related work, other hybrid architectures such as BoTNet, CvT, CMT, Mobile-ViT, Mobile-Former, and EfficientFormer have been proposed to combine convolutional layers with Transformers in order to leverage their respective strengths for better performance. The core designs within Next-ViT include NCB and NTB blocks for information interaction as well as modeling short-term and long-term dependencies in visual data. Moreover, fusion of local and global information is performed in the NTB block to further enhance modeling capability. Additionally, the proposed Next Hybrid Strategy integrates convolutional layers with Transformer blocks innovatively so as to overcome limitations posed by existing methods. Overall, the proposed architecture offers a promising solution for efficient deployment in realistic industrial scenarios due to its ability to outperform existing CNNs , ViTs ,and hybrid architectures when considering latency/accuracy trade off across various vision tasks .The code and models are publicly available for further exploration .

- Most existing vision Transformers (ViTs) struggle to match the efficiency of convolutional neural networks (CNNs) in realistic industrial scenarios
- Hybrid architectures combining CNNs and Transformers have not yielded satisfactory results
- Next-ViT is a next-generation vision Transformer that outperforms both CNNs and ViTs in terms of latency/accuracy trade-off
- Next-ViT introduces two key components: the Next Convolution Block (NCB) and the Next Transformer Block (NTB)
- NCB captures local information using deployment-friendly mechanisms, while NTB captures global information and combines it with local information for enhanced performance
- Next Hybrid Strategy (NHS) efficiently stacks NCB and NTB to improve performance across various vision tasks
- Extensive experiments demonstrate that Next-ViT surpasses existing CNNs, ViTs, and hybrid architectures in terms of latency/accuracy trade-off on TensorRT and CoreML platforms
- Next-ViT models are designed specifically for realistic industrial scenarios and offer improved efficiency compared to previous methods
- Other hybrid architectures like BoTNet, CvT, CMT, Mobile-ViT, Mobile-Former, and EfficientFormer have been proposed in related work
- Core designs within Next-ViT include NCB and NTB blocks for information interaction and modeling short-term/long-term dependencies in visual data
- Fusion of local and global information is performed in the NTB block to enhance modeling capability
- The proposed Next Hybrid Strategy integrates convolutional layers with Transformer blocks innovatively to overcome limitations posed by existing methods

Most vision Transformers struggle to be as efficient as convolutional neural networks in real-life situations. Combining CNNs and Transformers hasn't worked well so far. Next-ViT is a new kind of vision Transformer that performs better than both CNNs and ViTs when considering the balance between speed and accuracy. It introduces two important components: the Next Convolution Block (NCB) and the Next Transformer Block (NTB). NCB captures local information using easy-to-use methods, while NTB captures global information and combines it with local information for better performance.

Exploring Next-ViT: A Next-Generation Vision Transformer for Realistic Industrial Scenarios

Computer vision has been a rapidly growing field in recent years, and with it comes the need to develop efficient architectures that can handle large amounts of data. While convolutional neural networks (CNNs) have traditionally been used for this purpose, they are not always the most efficient option. This has led researchers to explore alternative architectures such as Transformers, which offer improved performance but struggle to match CNNs in terms of latency/accuracy trade-off. To address this challenge, a team of researchers from Tsinghua University and Microsoft Research Asia recently proposed a next-generation vision Transformer called Next-ViT that outperforms both CNNs and ViTs in terms of latency/accuracy trade-off.

Background on Vision Transformers

Vision Transformers (ViTs) are an emerging class of models based on the transformer architecture originally developed for natural language processing tasks. Unlike traditional CNNs, which rely heavily on handcrafted feature extractors such as convolutional layers, ViTs use self-attention mechanisms to learn features directly from input data without any prior knowledge or assumptions about its structure. As a result, ViTs are able to capture long-term dependencies more effectively than CNNs while also offering improved efficiency due to their parallelizable nature. However, these advantages come at the cost of increased complexity and difficulty in deployment due to their large number of parameters and computations required per layer.

The Proposed Architecture: Next-ViT

To overcome these challenges posed by existing methods, the authors propose a new hybrid architecture called Next-ViT that combines two key components: the Next Convolution Block (NCB) and the Next Transformer Block (NTB). The NCB captures local information using deployment friendly mechanisms while NTB captures global information and combines it with local information for enhanced performance. Additionally, they introduce a novel training strategy called “Next Hybrid Strategy” (NHS), which efficiently stacks NCB and NTB blocks together so as to improve performance across various vision tasks while maintaining low latency requirements suitable for industrial scenarios.

Experimental Results

Extensive experiments were conducted on TensorRT platform where results showed significant improvements over ResNet on COCO detection task while maintaining similar latency conditions; similarly impressive results were observed when tested against ADE20K segmentation task . On CoreML platform ,Next - ViT was found to outperform EfficientFormer under similar latency conditions when tested against COCO detection task & ADE20K segmentation task . Overall ,the proposed architecture offers promising solution for efficient deployment in realistic industrial scenarios due its ability outperform existing CNNs , ViTs & hybrid architectures when considering latency / accuracy trade off across various vision tasks . The code & models are publicly available for further exploration .

Conclusion

In conclusion ,the research paper presents an innovative approach towards combining convolutional layers with transformers through introduction of two core designs : NCB & NTB blocks along with NHS training strategy . Extensive experimentation demonstrates that proposed architecture is capable of achieving superior performance compared existing methods especially under realistic industrial scenarios where low latency is desired . With code & models being publicly available ,this work provides valuable insight into development future computer vision applications .

Created on 04 Dec. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

65.7%

A ConvNet for the 2020s

cs.CV

65.7%

An Empirical Study of Training Self-Supervised Visual Transformers

cs.CV

65.0%

Multiview Transformers for Video Recognition

cs.CV

64.8%

Scale-Aware Modulation Meet Transformer

cs.CV

64.1%

Efficient Vision Transformer for Accurate Traffic Sign Detection

cs.CV

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.