Token Merging: Your ViT But Faster

AI-generated keywords: Token Merging Vision Transformer AugReg MAE SWAG

AI-generated Key Points

  • Token Merging (ToMe) enhances the throughput of Vision Transformer (ViT) models without additional training.
  • ToMe merges similar tokens using a lightweight matching algorithm.
  • In off-the-shelf experiments, ToMe doubles the throughput of ViT-L @ 512 and ViT-H @ 518 models on images.
  • ToMe achieves 2.2x higher throughput on video with a marginal accuracy drop of 0.2-0.3%.
  • ToMe can be applied during training, resulting in up to 2x speed improvements for MAE fine-tuning on video.
  • Training with ToMe minimizes accuracy drop and leads to 2x higher throughput than ViT-B on audio with a slight mAP drop of 0.4%.
  • ToMe can merge object parts into one token across multiple frames of video.
  • Experiments on ImageNet-1k show consistent improvements in throughput without compromising accuracy when applying ToMe to different ViT models trained through various methods.
  • Ablation studies show that merging operations after attention using attention keys (K) improves accuracy significantly.
  • Gradually merging over multiple layers yields optimal results according to experimentation with different merge schedules.
  • Token Merging provides an efficient solution for increasing the throughput of ViT models without requiring additional training efforts.
  • Code for implementing ToMe is available on the authors' GitHub repository.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman

Preprint. Code will be available here: https://github.com/facebookresearch/ToMe
License: CC BY 4.0

Abstract: We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.

Submitted to arXiv on 17 Oct. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2210.09461v1

Token Merging (ToMe) is a method introduced to enhance the throughput of existing Vision Transformer (ViT) models without the need for additional training. ToMe achieves this by gradually merging similar tokens in a transformer using a lightweight matching algorithm that is both fast and accurate. In off-the-shelf experiments, ToMe demonstrates impressive results, doubling the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images, as well as achieving 2.2x higher throughput on video with only a marginal accuracy drop of 0.2-0.3% in each case. ToMe can also be applied during training, resulting in significant speed improvements of up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x higher throughput than ViT-B on audio with only a slight mAP drop of 0.4%. Notably, ToMe exhibits the ability to merge object parts into one token even across multiple frames of video. The authors conduct several experiments on ImageNet-1k using different ViT models trained through AugReg, MAE, SWAG and DeiT methods and compare off-the-shelf performance with ToMe applied observing consistent improvements in throughput without compromising accuracy. In terms of design choices they perform ablation studies to evaluate their approach's effectiveness finding that merging operations after attention using attention keys (K) instead of token features (X) significantly improves accuracy and experimenting with different merge schedules observing that merging gradually over multiple layers yields optimal results. Overall Token Merging presents an efficient solution for increasing the throughput of ViT models without requiring additional training efforts offering competitive performance in terms of accuracy and speed. The authors provide code for ToMe on their GitHub repository enabling easy implementation and further exploration of this method.
Created on 19 Aug. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.