Token Merging: Your ViT But Faster

AI-generated keywords: Token Merging Vision Transformer AugReg MAE SWAG

AI-generated Key Points

Token Merging (ToMe) enhances the throughput of Vision Transformer (ViT) models without additional training.
ToMe merges similar tokens using a lightweight matching algorithm.
In off-the-shelf experiments, ToMe doubles the throughput of ViT-L @ 512 and ViT-H @ 518 models on images.
ToMe achieves 2.2x higher throughput on video with a marginal accuracy drop of 0.2-0.3%.
ToMe can be applied during training, resulting in up to 2x speed improvements for MAE fine-tuning on video.
Training with ToMe minimizes accuracy drop and leads to 2x higher throughput than ViT-B on audio with a slight mAP drop of 0.4%.
ToMe can merge object parts into one token across multiple frames of video.
Experiments on ImageNet-1k show consistent improvements in throughput without compromising accuracy when applying ToMe to different ViT models trained through various methods.
Ablation studies show that merging operations after attention using attention keys (K) improves accuracy significantly.
Gradually merging over multiple layers yields optimal results according to experimentation with different merge schedules.
Token Merging provides an efficient solution for increasing the throughput of ViT models without requiring additional training efforts.
Code for implementing ToMe is available on the authors' GitHub repository.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman

arXiv: 2210.09461v1 - DOI (cs.CV)

Preprint. Code will be available here: https://github.com/facebookresearch/ToMe

License: CC BY 4.0

Abstract: We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.

Submitted to arXiv on 17 Oct. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2210.09461v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Token Merging (ToMe) is a method introduced to enhance the throughput of existing Vision Transformer (ViT) models without the need for additional training. ToMe achieves this by gradually merging similar tokens in a transformer using a lightweight matching algorithm that is both fast and accurate. In off-the-shelf experiments, ToMe demonstrates impressive results, doubling the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images, as well as achieving 2.2x higher throughput on video with only a marginal accuracy drop of 0.2-0.3% in each case. ToMe can also be applied during training, resulting in significant speed improvements of up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x higher throughput than ViT-B on audio with only a slight mAP drop of 0.4%. Notably, ToMe exhibits the ability to merge object parts into one token even across multiple frames of video. The authors conduct several experiments on ImageNet-1k using different ViT models trained through AugReg, MAE, SWAG and DeiT methods and compare off-the-shelf performance with ToMe applied observing consistent improvements in throughput without compromising accuracy. In terms of design choices they perform ablation studies to evaluate their approach's effectiveness finding that merging operations after attention using attention keys (K) instead of token features (X) significantly improves accuracy and experimenting with different merge schedules observing that merging gradually over multiple layers yields optimal results. Overall Token Merging presents an efficient solution for increasing the throughput of ViT models without requiring additional training efforts offering competitive performance in terms of accuracy and speed. The authors provide code for ToMe on their GitHub repository enabling easy implementation and further exploration of this method.

- Token Merging (ToMe) enhances the throughput of Vision Transformer (ViT) models without additional training.
- ToMe merges similar tokens using a lightweight matching algorithm.
- In off-the-shelf experiments, ToMe doubles the throughput of ViT-L @ 512 and ViT-H @ 518 models on images.
- ToMe achieves 2.2x higher throughput on video with a marginal accuracy drop of 0.2-0.3%.
- ToMe can be applied during training, resulting in up to 2x speed improvements for MAE fine-tuning on video.
- Training with ToMe minimizes accuracy drop and leads to 2x higher throughput than ViT-B on audio with a slight mAP drop of 0.4%.
- ToMe can merge object parts into one token across multiple frames of video.
- Experiments on ImageNet-1k show consistent improvements in throughput without compromising accuracy when applying ToMe to different ViT models trained through various methods.
- Ablation studies show that merging operations after attention using attention keys (K) improves accuracy significantly.
- Gradually merging over multiple layers yields optimal results according to experimentation with different merge schedules.
- Token Merging provides an efficient solution for increasing the throughput of ViT models without requiring additional training efforts.
- Code for implementing ToMe is available on the authors' GitHub repository.

Summary1. Token Merging (ToMe) makes Vision Transformer (ViT) models faster without needing more training. 2. ToMe combines similar tokens using a simple matching algorithm. 3. In tests, ToMe doubles the speed of ViT-L @ 512 and ViT-H @ 518 models on pictures. 4. ToMe is also faster on videos with a small drop in accuracy. 5. ToMe can be used during training to make video editing faster. Definitions- Token Merging (ToMe): A method that combines similar parts of information to make computer models work faster. - Vision Transformer (ViT): A type of computer model that can understand and analyze images or videos. - Throughput: How quickly a computer model can process information. - Algorithm: A set of instructions for solving a problem or completing a task. - Accuracy: How correct or accurate something is compared to the truth or desired result. - Training: The process of teaching a computer model how to do something by showing it examples and giving it feedback. - Fine-tuning: Making small adjustments to improve the performance of a computer model after it has been trained. - Object parts: Different pieces or sections of an object, like the wheels and body of a car. - Frames: Individual pictures that make up a video when played in sequence. - ImageNet-1k: A large dataset commonly used for testing image recognition algorithms and models. - Ablation studies: Experiments where certain

Introducing Token Merging (ToMe): A Method for Enhancing Vision Transformer (ViT) Throughput

The advent of deep learning has enabled the development of powerful models that can learn complex tasks from data. One such model is the Vision Transformer (ViT), which has become increasingly popular in recent years due to its ability to process large-scale visual data efficiently. However, ViT models are often limited by their throughput, meaning they may not be able to process large amounts of data quickly enough for certain applications. To address this issue, researchers have recently introduced a new method called Token Merging (ToMe), which can enhance ViT throughput without requiring additional training efforts.

How Does Token Merging Work?

Token Merging works by gradually merging similar tokens in a transformer using a lightweight matching algorithm that is both fast and accurate. This allows the model to reduce the number of tokens it needs to process while still maintaining accuracy. In off-the-shelf experiments, ToMe was shown to double the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images, as well as achieving 2.2x higher throughput on video with only a marginal accuracy drop of 0.2–0.3%. It can also be applied during training, resulting in significant speed improvements of up to 2x for MAE fine-tuning on video with minimal accuracy drop compared to baseline ViTs trained without ToMe.

Experiments and Results

The authors conducted several experiments on ImageNet 1k using different ViT models trained through AugReg, MAE, SWAG and DeiT methods and compared off-the shelf performance with ToMe applied observing consistent improvements in throughput without compromising accuracy. In terms of design choices they performed ablation studies evaluating their approach's effectiveness finding that merging operations after attention using attention keys (K) instead of token features (X) significantly improved accuracy and experimented with different merge schedules observing that merging gradually over multiple layers yielded optimal results. Notably, ToMe exhibited the ability to merge object parts into one token even across multiple frames of video demonstrating its potential for real world applications such as action recognition or tracking objects over time sequences .

Conclusion

Overall Token Merging presents an efficient solution for increasing the throughput of ViT models without requiring additional training efforts offering competitive performance in terms of accuracy and speed . The authors provide code for ToMe on their GitHub repository enabling easy implementation and further exploration of this method making it an attractive option when looking at ways to improve vision transformer performance .

Created on 19 Aug. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

61.9%

Masked Autoencoders Are Scalable Vision Learners

cs.CV

57.6%

VindLU: A Recipe for Effective Video-and-Language Pretraining

cs.CV

56.2%

Multiview Transformers for Video Recognition

cs.CV

55.3%

data2vec: A General Framework for Self-supervised Learning in Speech, Vision …

cs.LG

54.8%

Scale-Aware Modulation Meet Transformer

cs.CV

54.7%

An Empirical Study of Training Self-Supervised Visual Transformers

cs.CV

54.6%

Vision Transformers in 2022: An Update on Tiny ImageNet

cs.CV

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.