Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

AI-generated keywords: Vision transformers Attention layers Image classification Feed-forward layers Linear complexity

AI-generated Key Points

The study questions the necessity of attention layers in vision transformers
Experiment involved replacing attention layer with feed-forward layer over patch dimension
Modified architecture achieved impressive results on ImageNet
Suggests other factors like patch embeddings and training procedures may be more important than attention
Practical advantages of feed-forward-only model include linear complexity with respect to sequence length
Raises questions about the importance of attention layers in transformer architectures

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Luke Melas-Kyriazi

arXiv: 2105.02723v1 - DOI (cs.CV)

Short Technical Report. GitHub: https://github.com/lukemelas/do-you-even-need-attention

License: CC BY 4.0

Abstract: The strong performance of vision transformers on image classification and other vision tasks is often attributed to the design of their multi-head attention layers. However, the extent to which attention is responsible for this strong performance remains unclear. In this short report, we ask: is the attention layer even necessary? Specifically, we replace the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension. The resulting architecture is simply a series of feed-forward layers applied over the patch and feature dimensions in an alternating fashion. In experiments on ImageNet, this architecture performs surprisingly well: a ViT/DeiT-base-sized model obtains 74.9\% top-1 accuracy, compared to 77.9\% and 79.9\% for ViT and DeiT respectively. These results indicate that aspects of vision transformers other than attention, such as the patch embedding, may be more responsible for their strong performance than previously thought. We hope these results prompt the community to spend more time trying to understand why our current models are as effective as they are.

Submitted to arXiv on 06 May. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2105.02723v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the study "Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet," the authors explore the necessity of attention layers in vision transformers. Vision transformers have gained popularity for their strong performance in image classification tasks, often attributed to their multi-head attention layers. However, the extent to which attention contributes to this performance remains unclear. The authors conduct experiments where they replace the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension. This results in an architecture consisting of a series of feed-forward layers applied alternately over the patch and feature dimensions. Surprisingly, this modified architecture achieves impressive results on ImageNet, with a ViT/DeiT-base-sized model achieving 74.9% top-1 accuracy compared to 77.9% and 79.9% for traditional ViT and DeiT models, respectively. These findings suggest that aspects other than attention, such as patch embeddings and training procedures, may play a more significant role in the success of vision transformers than previously thought. The study also highlights practical advantages of using a feed-forward-only model, such as linear complexity with respect to sequence length compared to quadratic complexity in traditional vision transformers. While the feed-forward-only model may have limitations such as working only on fixed-length sequences, its performance raises questions about the importance of attention layers in transformer architectures. The results prompt further exploration into understanding why current models are effective and challenge existing assumptions about the key components driving their success in various vision tasks.

- The study questions the necessity of attention layers in vision transformers
- Experiment involved replacing attention layer with feed-forward layer over patch dimension
- Modified architecture achieved impressive results on ImageNet
- Suggests other factors like patch embeddings and training procedures may be more important than attention
- Practical advantages of feed-forward-only model include linear complexity with respect to sequence length
- Raises questions about the importance of attention layers in transformer architectures

Summary- A study looked at whether attention layers are needed in vision transformers. - They tried replacing an attention layer with a feed-forward layer over patches. - This change made the model perform very well on ImageNet. - The study suggests that things like patch embeddings and training methods might be more important than attention. - Using only feed-forward layers can make models simpler and faster. Definitions- Attention layers: Parts of a computer program that help it focus on specific areas or features. - Vision transformers: Computer models used for processing images and visual data. - Feed-forward layer: A type of neural network component that processes input data in one direction without feedback loops. - Patch dimension: Specific parts or sections of an image being analyzed.

Attention layers have been a fundamental component of transformer architectures, playing a crucial role in natural language processing (NLP) tasks. However, their necessity in vision transformers has been a topic of debate among researchers. In the study "Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet," the authors delve into this question by exploring the performance of feed-forward-only models on image classification tasks. The popularity of vision transformers has risen significantly due to their impressive results on various visual recognition benchmarks, often attributed to their multi-head attention layers. These layers allow for capturing long-range dependencies and extracting relevant features from images. However, it remains unclear how much attention contributes to the success of these models. To investigate this further, the authors conduct experiments where they replace the attention layer in a traditional vision transformer with a feed-forward layer applied over the patch dimension. This results in an architecture consisting of multiple feed-forward layers applied alternately over both patch and feature dimensions. Surprisingly, this modified architecture achieves remarkable results on ImageNet, with a ViT/DeiT-base-sized model achieving 74.9% top-1 accuracy compared to 77.9% and 79.9% for traditional ViT and DeiT models respectively. These findings suggest that aspects other than attention may play a more significant role in driving the success of vision transformers. One such aspect is patch embeddings – representations used to encode local information from images into tokens that can be processed by transformer networks. The study highlights that these embeddings may play a critical role in capturing spatial relationships between patches without relying heavily on attention mechanisms. Furthermore, training procedures also seem to contribute significantly to the performance of vision transformers as shown by recent studies comparing different pre-training methods such as supervised contrastive learning and self-supervised learning approaches like SimCLR and MoCo. Apart from questioning the importance of attention layers in current transformer architectures, this research also sheds light on the practical advantages of using a feed-forward-only model. One such advantage is its linear complexity with respect to sequence length, compared to the quadratic complexity in traditional vision transformers. This makes it more efficient and scalable for processing longer sequences. However, the feed-forward-only model may have limitations as it can only work on fixed-length sequences. This raises questions about its applicability in tasks that require variable-length inputs, such as object detection and segmentation. In conclusion, "Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet" challenges existing assumptions about the key components driving the success of vision transformers. The results prompt further exploration into understanding why current models are effective and highlight the need for more comprehensive studies to uncover their underlying mechanisms. As transformer architectures continue to evolve and find applications in various domains, this research provides valuable insights into their design principles and opens up new avenues for future advancements.

Created on 05 Jul. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

63.7%

QMViT: A Mushroom is worth 16x16 Words

cs.CV

63.5%

Scale-Aware Modulation Meet Transformer

cs.CV

62.4%

MixFormer: End-to-End Tracking with Iterative Mixed Attention

cs.CV

62.0%

CvT: Introducing Convolutions to Vision Transformers

cs.CV

60.6%

A ConvNet for the 2020s

cs.CV

59.9%

Explainable vision transformer enabled convolutional neural network for plant…

cs.CV

59.0%

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Real…

cs.CV

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.