Scale-Aware Modulation Meet Transformer

AI-generated keywords: Scale-Aware Modulation Transformer (SMT)

AI-generated Key Points

  • Scale-Aware Modulation Transformer (SMT) combines convolutional networks and vision Transformers for efficient handling of downstream tasks
  • SMT introduces the Multi-Head Mixed Convolution (MHMC) module to capture multi-scale features and expand receptive field
  • SMT also includes the Scale-Aware Aggregation (SAA) module for lightweight but effective information fusion across different heads
  • Evolutionary Hybrid Network (EHN) simulates the shift from capturing local to global dependencies as network depth increases, resulting in superior performance
  • SMT outperforms existing state-of-the-art models across various visual tasks
  • Achieves high top-1 accuracy on ImageNet-1K with different model sizes and computational costs
  • Pretrained on ImageNet-22K, achieves high top-1 accuracy when finetuned with different resolutions
  • SMT base outperforms Swin Transformer in object detection using Mask R-CNN on COCO dataset, both with 1x and 3x training schedules.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Weifeng Lin, Ziheng Wu, Jiayu Chen, Jun Huang, Lianwen Jin

Accepted to ICCV 2023
License: CC BY 4.0

Abstract: This paper presents a new vision Transformer, Scale-Aware Modulation Transformer (SMT), that can handle various downstream tasks efficiently by combining the convolutional network and vision Transformer. The proposed Scale-Aware Modulation (SAM) in the SMT includes two primary novel designs. Firstly, we introduce the Multi-Head Mixed Convolution (MHMC) module, which can capture multi-scale features and expand the receptive field. Secondly, we propose the Scale-Aware Aggregation (SAA) module, which is lightweight but effective, enabling information fusion across different heads. By leveraging these two modules, convolutional modulation is further enhanced. Furthermore, in contrast to prior works that utilized modulations throughout all stages to build an attention-free network, we propose an Evolutionary Hybrid Network (EHN), which can effectively simulate the shift from capturing local to global dependencies as the network becomes deeper, resulting in superior performance. Extensive experiments demonstrate that SMT significantly outperforms existing state-of-the-art models across a wide range of visual tasks. Specifically, SMT with 11.5M / 2.4GFLOPs and 32M / 7.7GFLOPs can achieve 82.2% and 84.3% top-1 accuracy on ImageNet-1K, respectively. After pretrained on ImageNet-22K in 224^2 resolution, it attains 87.1% and 88.1% top-1 accuracy when finetuned with resolution 224^2 and 384^2, respectively. For object detection with Mask R-CNN, the SMT base trained with 1x and 3x schedule outperforms the Swin Transformer counterpart by 4.2 and 1.3 mAP on COCO, respectively. For semantic segmentation with UPerNet, the SMT base test at single- and multi-scale surpasses Swin by 2.0 and 1.1 mIoU respectively on the ADE20K.

Submitted to arXiv on 17 Jul. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2307.08579v1

This paper introduces a new vision Transformer called Scale-Aware Modulation Transformer (SMT) that combines convolutional networks and vision Transformers to efficiently handle various downstream tasks. The proposed Scale-Aware Modulation (SAM) in SMT consists of two primary novel designs. Firstly, the Multi-Head Mixed Convolution (MHMC) module is introduced to capture multi-scale features and expand the receptive field. Secondly, the Scale-Aware Aggregation (SAA) module enables lightweight but effective information fusion across different heads. These modules enhance convolutional modulation. In contrast to previous works that utilized modulations throughout all stages to build an attention-free network, this paper proposes an Evolutionary Hybrid Network (EHN). EHN effectively simulates the shift from capturing local to global dependencies as the network becomes deeper, resulting in superior performance. Extensive experiments demonstrate that SMT outperforms existing state-of-the-art models across a wide range of visual tasks. For instance, SMT achieves 82.2% and 84.3% top-1 accuracy on ImageNet-1K with 11.5M / 2.4GFLOPs and 32M / 7.7GFLOPs, respectively. After being pretrained on ImageNet-22K in 224^2 resolution, it attains 87.1% and 88.1% top-1 accuracy when finetuned with resolutions of 224^2 and 384^2, respectively. For object detection using Mask R-CNN, the SMT base trained with a 1x schedule outperforms the Swin Transformer counterpart by 4.2 mAP on COCO dataset, while training with a 3x schedule surpasses it by 1.3 mAP.
Created on 26 Jul. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.