Token Merging (ToMe) is a method introduced to enhance the throughput of existing Vision Transformer (ViT) models without the need for additional training. ToMe achieves this by gradually merging similar tokens in a transformer using a lightweight matching algorithm that is both fast and accurate. In off-the-shelf experiments, ToMe demonstrates impressive results, doubling the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images, as well as achieving 2.2x higher throughput on video with only a marginal accuracy drop of 0.2-0.3% in each case. ToMe can also be applied during training, resulting in significant speed improvements of up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x higher throughput than ViT-B on audio with only a slight mAP drop of 0.4%. Notably, ToMe exhibits the ability to merge object parts into one token even across multiple frames of video. The authors conduct several experiments on ImageNet-1k using different ViT models trained through AugReg, MAE, SWAG and DeiT methods and compare off-the-shelf performance with ToMe applied observing consistent improvements in throughput without compromising accuracy. In terms of design choices they perform ablation studies to evaluate their approach's effectiveness finding that merging operations after attention using attention keys (K) instead of token features (X) significantly improves accuracy and experimenting with different merge schedules observing that merging gradually over multiple layers yields optimal results. Overall Token Merging presents an efficient solution for increasing the throughput of ViT models without requiring additional training efforts offering competitive performance in terms of accuracy and speed. The authors provide code for ToMe on their GitHub repository enabling easy implementation and further exploration of this method.
- - Token Merging (ToMe) enhances the throughput of Vision Transformer (ViT) models without additional training.
- - ToMe merges similar tokens using a lightweight matching algorithm.
- - In off-the-shelf experiments, ToMe doubles the throughput of ViT-L @ 512 and ViT-H @ 518 models on images.
- - ToMe achieves 2.2x higher throughput on video with a marginal accuracy drop of 0.2-0.3%.
- - ToMe can be applied during training, resulting in up to 2x speed improvements for MAE fine-tuning on video.
- - Training with ToMe minimizes accuracy drop and leads to 2x higher throughput than ViT-B on audio with a slight mAP drop of 0.4%.
- - ToMe can merge object parts into one token across multiple frames of video.
- - Experiments on ImageNet-1k show consistent improvements in throughput without compromising accuracy when applying ToMe to different ViT models trained through various methods.
- - Ablation studies show that merging operations after attention using attention keys (K) improves accuracy significantly.
- - Gradually merging over multiple layers yields optimal results according to experimentation with different merge schedules.
- - Token Merging provides an efficient solution for increasing the throughput of ViT models without requiring additional training efforts.
- - Code for implementing ToMe is available on the authors' GitHub repository.
Summary1. Token Merging (ToMe) makes Vision Transformer (ViT) models faster without needing more training.
2. ToMe combines similar tokens using a simple matching algorithm.
3. In tests, ToMe doubles the speed of ViT-L @ 512 and ViT-H @ 518 models on pictures.
4. ToMe is also faster on videos with a small drop in accuracy.
5. ToMe can be used during training to make video editing faster.
Definitions- Token Merging (ToMe): A method that combines similar parts of information to make computer models work faster.
- Vision Transformer (ViT): A type of computer model that can understand and analyze images or videos.
- Throughput: How quickly a computer model can process information.
- Algorithm: A set of instructions for solving a problem or completing a task.
- Accuracy: How correct or accurate something is compared to the truth or desired result.
- Training: The process of teaching a computer model how to do something by showing it examples and giving it feedback.
- Fine-tuning: Making small adjustments to improve the performance of a computer model after it has been trained.
- Object parts: Different pieces or sections of an object, like the wheels and body of a car.
- Frames: Individual pictures that make up a video when played in sequence.
- ImageNet-1k: A large dataset commonly used for testing image recognition algorithms and models.
- Ablation studies: Experiments where certain
Introducing Token Merging (ToMe): A Method for Enhancing Vision Transformer (ViT) Throughput
The advent of deep learning has enabled the development of powerful models that can learn complex tasks from data. One such model is the Vision Transformer (ViT), which has become increasingly popular in recent years due to its ability to process large-scale visual data efficiently. However, ViT models are often limited by their throughput, meaning they may not be able to process large amounts of data quickly enough for certain applications. To address this issue, researchers have recently introduced a new method called Token Merging (ToMe), which can enhance ViT throughput without requiring additional training efforts.
How Does Token Merging Work?
Token Merging works by gradually merging similar tokens in a transformer using a lightweight matching algorithm that is both fast and accurate. This allows the model to reduce the number of tokens it needs to process while still maintaining accuracy. In off-the-shelf experiments, ToMe was shown to double the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images, as well as achieving 2.2x higher throughput on video with only a marginal accuracy drop of 0.2–0.3%. It can also be applied during training, resulting in significant speed improvements of up to 2x for MAE fine-tuning on video with minimal accuracy drop compared to baseline ViTs trained without ToMe.
Experiments and Results
The authors conducted several experiments on ImageNet 1k using different ViT models trained through AugReg, MAE, SWAG and DeiT methods and compared off-the shelf performance with ToMe applied observing consistent improvements in throughput without compromising accuracy. In terms of design choices they performed ablation studies evaluating their approach's effectiveness finding that merging operations after attention using attention keys (K) instead of token features (X) significantly improved accuracy and experimented with different merge schedules observing that merging gradually over multiple layers yielded optimal results. Notably, ToMe exhibited the ability to merge object parts into one token even across multiple frames of video demonstrating its potential for real world applications such as action recognition or tracking objects over time sequences .
Conclusion
Overall Token Merging presents an efficient solution for increasing the throughput of ViT models without requiring additional training efforts offering competitive performance in terms of accuracy and speed . The authors provide code for ToMe on their GitHub repository enabling easy implementation and further exploration of this method making it an attractive option when looking at ways to improve vision transformer performance .