Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios
AI-generated Key Points
- Most existing vision Transformers (ViTs) struggle to match the efficiency of convolutional neural networks (CNNs) in realistic industrial scenarios
- Hybrid architectures combining CNNs and Transformers have not yielded satisfactory results
- Next-ViT is a next-generation vision Transformer that outperforms both CNNs and ViTs in terms of latency/accuracy trade-off
- Next-ViT introduces two key components: the Next Convolution Block (NCB) and the Next Transformer Block (NTB)
- NCB captures local information using deployment-friendly mechanisms, while NTB captures global information and combines it with local information for enhanced performance
- Next Hybrid Strategy (NHS) efficiently stacks NCB and NTB to improve performance across various vision tasks
- Extensive experiments demonstrate that Next-ViT surpasses existing CNNs, ViTs, and hybrid architectures in terms of latency/accuracy trade-off on TensorRT and CoreML platforms
- Next-ViT models are designed specifically for realistic industrial scenarios and offer improved efficiency compared to previous methods
- Other hybrid architectures like BoTNet, CvT, CMT, Mobile-ViT, Mobile-Former, and EfficientFormer have been proposed in related work
- Core designs within Next-ViT include NCB and NTB blocks for information interaction and modeling short-term/long-term dependencies in visual data
- Fusion of local and global information is performed in the NTB block to enhance modeling capability
- The proposed Next Hybrid Strategy integrates convolutional layers with Transformer blocks innovatively to overcome limitations posed by existing methods
Authors: Jiashi Li, Xin Xia, Wei Li, Huixia Li, Xing Wang, Xuefeng Xiao, Rui Wang, Min Zheng, Xin Pan
Abstract: Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, Next Hybrid Strategy (NHS) is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, Next-ViT surpasses ResNet by 5.5 mAP (from 40.4 to 45.9) on COCO detection and 7.7% mIoU (from 38.8% to 46.5%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6x. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.1% to 48.6%) on ADE20K segmentation under similar latency. Our code and models are made public at: https://github.com/bytedance/Next-ViT
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.