GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

AI-generated keywords: GLaM Language Model Data Filtering Ethical Considerations Performance

AI-generated Key Points

  • GLaM (Generalist Language Model) is a family of language models that uses a sparsely activated mixture-of-experts architecture.
  • GLaM models have significantly reduced training costs compared to dense variants.
  • The largest GLaM model has 1.2 trillion parameters, 7 times larger than GPT-3, but consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference.
  • A high-quality dataset of 1.6 trillion tokens representative of various natural language use cases is constructed to train the GLaM models.
  • The dataset consists mainly of web pages, ranging from professional writing to low-quality comment and forum pages.
  • A text quality classifier is developed to filter out low-quality webpages and create a high-quality web corpus.
  • Ethical considerations related to large language models are addressed in the study, including potential biases associated with filtering marginalized groups' text and reinforcing unfair bias when automatically filtering low-quality content from web text collections.
  • Existing charters from organizations like OpenAI, Google, Facebook, and Microsoft are highlighted as addressing these ethical concerns.
  • The proposed GLaM models demonstrate efficient scaling capabilities while considering ethical considerations associated with training data quality and task applications.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathy Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui

License: CC BY 4.0

Abstract: Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.

Submitted to arXiv on 13 Dec. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2112.06905v1

In this paper, the authors propose and develop a family of language models called GLaM (Generalist Language Model) that utilizes a sparsely activated mixture-of-experts architecture. This allows for scaling the model capacity while significantly reducing training costs compared to dense variants. The largest GLaM model has 1.2 trillion parameters, approximately 7 times larger than GPT-3, but consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference. To train the GLaM models, a high-quality dataset of 1.6 trillion tokens representative of various natural language use cases is constructed. The dataset consists mainly of web pages, ranging from professional writing to low-quality comment and forum pages. A text quality classifier is developed to filter out low-quality webpages and create a high-quality web corpus. This classifier is trained to distinguish between curated text (such as Wikipedia, books, and selected websites) and other webpages. The filtered subset of webpages is combined with books, Wikipedia pages, public domain social media conversations, and other data sources to create the final GLaM dataset. The importance of data filtering in improving model quality is analyzed in the study. Ethical considerations related to large language models are also addressed in this work. The authors emphasize the significance of high-quality pre-training corpora for achieving good model performance. They acknowledge the potential for over-filtering text associated with marginalized groups and reinforcing unfair bias when automatically filtering low-quality content from web text collections. The authors advocate for a more thoughtful approach when deciding which tasks should be pursued using language models and which tasks should be avoided due to ethical implications. They highlight existing charters from organizations like OpenAI, Google, Facebook, and Microsoft that address these concerns. Overall, the proposed GLaM models demonstrate efficient scaling capabilities while considering ethical considerations associated with training data quality and task applications.
Created on 15 Jul. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.