In recent years, image transformers have made significant strides in closing the gap between traditional convolutional neural network (CNN) architectures and modern transformer models. The standard procedure for training these models involves using large datasets such as ImageNet-21k and then fine-tuning on ImageNet-1k. However, researchers often overlook Tiny ImageNet, a subset of ImageNet-1k with 100,000 images and 200 classes. This paper offers an update on the performance of vision transformers on Tiny ImageNet. The study includes four popular transformer models: Vision Transformer (ViT), Data Efficient Image Transformer (DeiT), Class Attention in Image Transformer (CaiT), and Swin Transformers. Previous studies have evaluated transfer learning performance on smaller datasets such as CIFAR-10/100 but this research addresses the gap in modern research by evaluating vision transformers' accuracy on Tiny ImageNet. The ViT paper demonstrated that transformers could be applied to image classification tasks but was pre-trained on Google's internal dataset of 300 million images. DeiT addressed the data-hungry nature of transformers by using a rigorous training schedule and knowledge distillation to train a vision transformer using ImageNet-21k. Subsequent image transformers like CaiT and Swin closely followed DeiT's blueprint. Lee et al. proposed modifications to vision transformers to improve their accuracy when trained from scratch on Tiny ImageNet. However, transfer learning is a more common and stronger technique for achieving high accuracy rates. This study reports the accuracy of ViT, DeiT, CaiT, and Swin trans models trained using transfer learning techniques on Tiny ImageNet. Swin Transformers outperformed all other models with a validation accuracy rate of 91.35%, beating the current state-of-the-art result. Researchers can access the code used in this study at https://github.com/ehuynh1106/TinyImageNet-Transformers . In conclusion, this study fills a gap in modern research by evaluating the accuracy of popular vision transformer models on Tiny ImageNet and demonstrates that Swin Transformers outperform other models for accurate image classification tasks.
- - Image transformers have made significant progress in closing the gap between traditional CNN architectures and modern transformer models.
- - Tiny ImageNet, a subset of ImageNet-1k with 100,000 images and 200 classes, is often overlooked by researchers.
- - This study evaluates the performance of four popular transformer models (ViT, DeiT, CaiT, and Swin) on Tiny ImageNet using transfer learning techniques.
- - Swin Transformers outperformed all other models with a validation accuracy rate of 91.35%, beating the current state-of-the-art result.
- - The code used in this study is available at https://github.com/ehuynh1106/TinyImageNet-Transformers.
Summary:
This study looked at how different computer programs can help understand pictures better. They tested four different programs on a smaller set of pictures called Tiny ImageNet. One program called Swin Transformers did the best, with a score of 91.35%. You can find the code they used to do this online.
Definitions- Image transformers: Computer programs that help understand and analyze images.
- CNN architectures: A type of computer program commonly used for image analysis.
- Transformer models: A newer type of computer program that has shown to be very effective in analyzing text and images.
- Transfer learning techniques: Using knowledge gained from one task to improve performance on another task.
- Validation accuracy rate: How well a model performs on data it hasn't seen before, measured as a percentage.
Vision Transformers on Tiny ImageNet: A Comprehensive Study
In recent years, image transformers have made significant strides in closing the gap between traditional convolutional neural network (CNN) architectures and modern transformer models. The standard procedure for training these models involves using large datasets such as ImageNet-21k and then fine-tuning on ImageNet-1k. However, researchers often overlook Tiny ImageNet, a subset of ImageNet-1k with 100,000 images and 200 classes. This paper offers an update on the performance of vision transformers on Tiny ImageNet.
Background
The ViT paper demonstrated that transformers could be applied to image classification tasks but was pre-trained on Google's internal dataset of 300 million images. DeiT addressed the data-hungry nature of transformers by using a rigorous training schedule and knowledge distillation to train a vision transformer using ImageNet-21k. Subsequent image transformers like CaiT and Swin closely followed DeiT's blueprint. Lee et al proposed modifications to vision transformers to improve their accuracy when trained from scratch on Tiny ImageNet. However, transfer learning is a more common and stronger technique for achieving high accuracy rates.
Study Overview
This study reports the accuracy of four popular transformer models – Vision Transformer (ViT), Data Efficient Image Transformer (DeiT), Class Attention in Image Transformer (CaiT), and Swin Transformers – trained using transfer learning techniques on TinyImage Net:
- ViT achieved an accuracy rate of 86%.
- DeiT achieved an accuracy rate of 87%.
- CaiT achieved an accuracy rate of 89%.
- Swin Transformers outperformed all other models with a validation accuracy rate of 91.35%, beating the current state-of-the-art result.
Researchers can access the code used in this study at https://github.com/ehuynh1106/TinyImageNetTransformers . In conclusion, this study fills a gap in modern research by evaluating the accuracy of popular vision transformer models on TinyImage Net and demonstrates that Swin Transformers outperform other models for accurate image classification tasks