Hyena Hierarchy: Towards Larger Convolutional Language Models

AI-generated keywords: Hyena Attention NLP Transformers Efficiency

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Deep learning has made significant strides in NLP tasks, with large Transformers being a popular choice.
Attention operator - a core building block of Transformers - exhibits quadratic cost in sequence length, which limits the amount of context that can be accessed.
Existing subquadratic methods based on low-rank and sparse approximations have been developed but still need to be combined with dense attention layers to match the performance of Transformers.
Michael Poli and his team propose Hyena - a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating.
In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models.
The team also set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K.
Hyena operators are twice as fast as highly optimized attention at sequence length 8K and 100x faster at sequence length 64K.
Overall, Hyena offers an efficient alternative to traditional attention mechanisms used in NLP tasks. Its success highlights the potential for further research into developing more efficient deep learning models that can handle larger sequences while maintaining high levels of accuracy.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré

arXiv: 2302.10866v1 - DOI (cs.LG)

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.

Submitted to arXiv on 21 Feb. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2302.10866v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In recent years, deep learning has made significant strides in natural language processing (NLP) tasks, with large Transformers being a popular choice due to their ability to learn at scale. However, the attention operator - a core building block of Transformers - exhibits quadratic cost in sequence length, which limits the amount of context that can be accessed. While existing subquadratic methods based on low-rank and sparse approximations have been developed, they still need to be combined with dense attention layers to match the performance of Transformers. This indicates a gap in capability that needs to be addressed. To address this issue, Michael Poli and his team propose Hyena - a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. The team also set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Furthermore, Hyena operators are twice as fast as highly optimized attention at sequence length 8K and 100x faster at sequence length 64K. Overall, Hyena offers an efficient alternative to traditional attention mechanisms used in NLP tasks. Its success highlights the potential for further research into developing more efficient deep learning models that can handle larger sequences while maintaining high levels of accuracy.

- Deep learning has made significant strides in NLP tasks, with large Transformers being a popular choice.
- Attention operator - a core building block of Transformers - exhibits quadratic cost in sequence length, which limits the amount of context that can be accessed.
- Existing subquadratic methods based on low-rank and sparse approximations have been developed but still need to be combined with dense attention layers to match the performance of Transformers.
- Michael Poli and his team propose Hyena - a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating.
- In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models.
- The team also set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K.
- Hyena operators are twice as fast as highly optimized attention at sequence length 8K and 100x faster at sequence length 64K.
- Overall, Hyena offers an efficient alternative to traditional attention mechanisms used in NLP tasks. Its success highlights the potential for further research into developing more efficient deep learning models that can handle larger sequences while maintaining high levels of accuracy.

1. Deep learning is a type of computer technology that helps computers understand language better. 2. Transformers are a popular tool used in deep learning for language tasks, but they have limits on how much information they can process at once. 3. Scientists have developed a new tool called Hyena that works with Transformers to help them process more information and improve accuracy. 4. Hyena is faster than traditional attention mechanisms used in deep learning and can handle larger amounts of information while maintaining high levels of accuracy. 5. This new technology shows promise for future research into developing even more efficient deep learning models. Definitions- Deep learning: A type of computer technology that helps computers understand language better - Transformers: A popular tool used in deep learning for language tasks - Attention operator: A core building block of Transformers that helps them process information - Quadratic cost: A measure of how much time and resources it takes to process information, which increases exponentially as the amount of information grows - Subquadratic methods: Methods that can process large amounts of information without increasing processing time exponentially - Michael Poli: The name of a scientist who helped develop the Hyena technology

Introducing Hyena: A Subquadratic Drop-in Replacement for Attention in Natural Language Processing

What is Hyena?

Hyena is an efficient alternative to traditional attention mechanisms used in NLP tasks. It is constructed by interleaving implicitly parametrized long convolutions and data-controlled gating as opposed to relying on state spaces or other implicit or explicit methods.

Performance Improvements

In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. The team also set a new state-of-the-art for dense-attention free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Furthermore, Hyena operators are twice as fast as highly optimized attention at sequence length 8K and 100x faster at sequence length 64K.

Conclusion

Overall, Hyena offers an efficient alternative to traditional attention mechanisms used in NLP tasks while maintaining high levels of accuracy across various lengths of sequences from thousands up through hundreds of thousands tokens longs without needing any additional dense layers or extra compute time beyond what would normally be needed for traditional attentions models like Transformers.. Its success highlights the potential for further research into developing more efficient deep learning models that can handle larger sequences while maintaining high levels of accuracy

Created on 23 Apr. 2023

Available in other languages: fr

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

69.0%

A Little Bit Attention Is All You Need for Person Re-Identification

cs.RO

69.0%

Attention is All You Need? Good Embeddings with Statistics are enough:Large S…

cs.SD

68.9%

Emergent autonomous scientific research capabilities of large language models

physics.chem-ph

68.6%

What do Vision Transformers Learn? A Visual Exploration

cs.CV

67.9%

Attention Is All You Need

cs.CL

67.4%

A Hierarchical Transformation-Discriminating Generative Model for Few Shot An…

cs.CV

66.9%

Large language models effectively leverage document-level context for literar…

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.