Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

AI-generated keywords: Language models RNNs Hawk Griffin efficiency

AI-generated Key Points

  • Authors introduce Hawk and Griffin, two novel recurrent neural network (RNN) models
  • Hawk features gated linear recurrences, while Griffin combines gated linear recurrences with local attention mechanisms
  • Hawk outperforms Mamba on downstream tasks, while Griffin matches the performance of Llama-2 with fewer tokens and showcases extrapolation ability on longer sequences
  • Models exhibit hardware efficiency comparable to Transformers during training and offer lower latency and higher throughput during inference
  • Griffin successfully scaled up to 14 billion parameters and insights provided on sharding for efficient distributed training
  • Evaluation shows superior performance of Griffin in terms of both latency and throughput compared to MQA Transformer
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, Caglar Gulcehre

25 pages, 11 figures
License: CC BY 4.0

Abstract: Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.

Submitted to arXiv on 29 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.19427v1

In their paper titled "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models," authors Soham De, Samuel L. Smith, Anushan Fernando, and their team introduce Hawk and Griffin - two novel recurrent neural network (RNN) models designed to address the challenges of training and scaling RNNs. <br> <Hawk>Hawk</Hawk> features gated linear recurrences while <Griffin>Griffin</Griffin> combines gated linear recurrences with local attention mechanisms. The authors demonstrate that <Hawk>Hawk</Hawk> outperforms Mamba on downstream tasks while <Griffin>Griffin</Griffin> matches the performance of Llama-2 despite being trained on significantly fewer tokens. Notably, <Griffin>Griffin</Griffin> showcases the ability to extrapolate on sequences longer than those encountered during training. The models exhibit hardware efficiency comparable to Transformers during training and offer lower latency and higher throughput during inference.<br> Furthermore, the authors successfully scale up <Griffin>Griffin</Griffin> to 14 billion parameters and provide insights into sharding the models for efficient distributed training. In their evaluation of models with 1 billion parameters, they compare the latency and throughput of a MQA Transformer against <Hawk>Hawk</Hawk> and <Griffin>Griffin</Griffin>. The results highlight the superior performance of <Griffin> Griffin </griffen > in terms of both latency and throughput.<br> Overall, the study presents <Hawk>Hawk</Hawk> and <griffen > griffen </griffen > as promising solutions for fast training at scale in language modeling tasks - showcasing their effectiveness in surpassing existing benchmarks while maintaining hardware efficiency and scalability.
Created on 10 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.