xLSTM: Extended Long Short-Term Memory

AI-generated keywords: LSTM Transformer xLSTM language modeling scalability

AI-generated Key Points

  • Long Short-Term Memory (LSTM) models introduced in the 1990s with concepts of constant error carousel and gating
  • LSTM instrumental in deep learning success stories, particularly in Large Language Models (LLMs)
  • Emergence of Transformer technology with parallelizable self-attention capabilities surpassing LSTMs in scalability
  • Introduction of Extended Long Short-Term Memory (xLSTM) model to address effectiveness of scaled LSTMs incorporating modern techniques from LLMs
  • xLSTM incorporates exponential gating with normalization and stabilization techniques, sLSTM with scalar memory, and mLSTM fully parallelizable with matrix memory
  • Performance evaluation of xLSTM through experiments on formal languages, Multi-Query Associative Recall tasks, Long Range Arena sequences, and SlimPajama dataset containing 15B tokens
  • Ablation studies on xLSTM for better understanding of performance
  • Comparison against other top-performing methods using 300B tokens from SlimPajama dataset for training in comprehensive language modeling experiment
  • Evaluation criteria include extrapolation to longer contexts, validation perplexity scores, performance on downstream tasks, assessment across multiple text domains using PALOMA benchmark dataset, and scaling behavior analysis
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter

License: CC BY 4.0

Abstract: In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.

Submitted to arXiv on 07 May. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2405.04517v1

In the 1990s, Long Short-Term Memory (LSTM) models were introduced with the concepts of constant error carousel and gating. These have since been instrumental in various deep learning success stories, particularly in the development of Large Language Models (LLMs). However, the emergence of Transformer technology with parallelizable self-attention capabilities has ushered in a new era, surpassing LSTMs in terms of scalability. This led to a fundamental question: How effective can LSTMs be in language modeling when scaled to billions of parameters while incorporating modern techniques from LLMs and addressing known limitations? To address this question, a new approach called Extended Long Short-Term Memory (xLSTM) was proposed. The xLSTM model incorporates exponential gating with normalization and stabilization techniques, as well as modifications to the LSTM memory structure. These modifications include introducing sLSTM with scalar memory and update mechanisms, alongside mLSTM which is fully parallelizable with matrix memory and covariance update rules. Through a series of experiments detailed in Section 4, xLSTM's performance was evaluated and compared against existing methods in language modeling tasks. Synthetic tasks were used to test xLSTM's effectiveness on formal languages and Multi-Query Associative Recall tasks. Additionally, xLSTM's ability to process long sequences was assessed in the Long Range Arena. Furthermore, extensive experiments were conducted using the SlimPajama dataset containing 15B tokens for training and evaluation. Ablation studies were performed on xLSTM to understand its performance better. Scaling behavior comparisons similar to previous studies by Kaplan et al. (2020) and Brown et al. (2020) were also carried out. In a more comprehensive language modeling experiment using 300B tokens from SlimPajama dataset for training,xLSTM was compared against other top-performing methods. Evaluation criteria included extrapolation to longer contexts, validation perplexity scores, performance on downstream tasks such as those outlined by Sutawika et al., 2024), assessment across multiple text domains using the PALOMA benchmark dataset by Magnusson et al., 2023), and scaling behavior analysis with increased training data. Overall, through these experiments and evaluations, xLSTM demonstrated favorable performance compared to state-of-the-art Transformers and State Space Models both in terms of efficiency and scalability for language modeling tasks.
Created on 20 Jun. 2024
Available in other languages: fr

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.