Hungry Hungry Hippos: Towards Language Modeling with State Space Models

AI-generated keywords: State Space Models

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • State space models (SSMs) underperform attention-based models like Transformers in language modeling tasks
  • SSMs are slower than Transformers due to inefficient hardware utilization
  • The authors propose a new SSM layer called H3 to address the limitations of existing SSMs, achieving comparable performance to attention on synthetic languages and coming close to Transformers on OpenWebText dataset
  • A hybrid model combining H3-attention outperforms Transformers on OpenWebText dataset by 1.0 perplexity per token (PPL)
  • FlashConv technique improves efficiency for training SSMs on modern hardware, achieving a 2x speedup on long-range arena benchmark and enabling faster text generation than Transformers
  • Hybrid H3-attention language models scaled up to 2.7B parameters achieve lower perplexity than Transformers and outperform them in zero- and few-shot learning on SuperGLUE benchmark
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré

ICLR 2023 Camera-Ready (Notable-top-25% / Spotlight)

Abstract: State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor hardware utilization. In this paper, we make progress on understanding the expressivity gap between SSMs and attention in language modeling, and on reducing the hardware barrier between SSMs and attention. First, we use synthetic language modeling tasks to understand the gap between SSMs and attention. We find that existing SSMs struggle with two capabilities: recalling earlier tokens in the sequence and comparing tokens across the sequence. To understand the impact on language modeling, we propose a new SSM layer, H3, that is explicitly designed for these abilities. H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on OpenWebText. Furthermore, a hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms Transformers on OpenWebText by 1.0 PPL. Next, to improve the efficiency of training SSMs on modern hardware, we propose FlashConv. FlashConv uses a fused block FFT algorithm to improve efficiency on sequences up to 8K, and introduces a novel state passing algorithm that exploits the recurrent properties of SSMs to scale to longer sequences. FlashConv yields 2$\times$ speedup on the long-range arena benchmark and allows hybrid language models to generate text 2.4$\times$ faster than Transformers. Using FlashConv, we scale hybrid H3-attention language models up to 2.7B parameters on the Pile and find promising initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark.

Submitted to arXiv on 28 Dec. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2212.14052v3

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The paper titled "Hungry Hungry Hippos: Towards Language Modeling with State Space Models" explores the performance of state space models (SSMs) in language modeling compared to attention-based models like Transformers. While SSMs have shown excellent sequence modeling performance in some domains, they underperform attention in language modeling tasks. Additionally, despite their linear scaling in sequence length, SSMs are slower than Transformers due to inefficient hardware utilization. To bridge the gap between SSMs and attention in language modeling, the authors conduct experiments using synthetic language modeling tasks. They identify two key challenges that existing SSMs struggle with: recalling earlier tokens in the sequence and comparing tokens across the sequence. To address these limitations, they propose a new SSM layer called H3, specifically designed for these capabilities. The H3 layer achieves comparable performance to attention on synthetic languages and comes within 0.4 perplexity per token (PPL) of Transformers on OpenWebText dataset. Furthermore, the authors introduce a hybrid model combining a 125M-parameter H3-attention model with two attention layers. Surprisingly, this hybrid model outperforms Transformers by 1.0 PPL on OpenWebText dataset. To improve the efficiency of training SSMs on modern hardware, the authors propose FlashConv. This technique utilizes a fused block Fast Fourier Transform (FFT) algorithm to enhance efficiency for sequences up to 8K length. It also introduces a novel state passing algorithm that leverages the recurrent properties of SSMs to scale to longer sequences. FlashConv achieves a 2x speedup on the long-range arena benchmark and enables hybrid language models to generate text 2.4x faster than Transformers. Using FlashConv, the authors scale up hybrid H3-attention language models up to 2.7B parameters on the Pile dataset and achieve promising results. These models achieve lower perplexity than Transformers and outperform Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark. Overall, this paper provides insights into the expressivity gap between SSMs and attention in language modeling and proposes solutions to improve their performance.
Created on 11 Dec. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.