Flan-MoE: Scaling Instruction-Finetuned Language Models with Sparse Mixture of Experts

AI-generated keywords: Flan-MoE Instruction-Finetuned Sparse Mixture-of-Expert (MoE) models Efficient and Scalable Methods Task-Agnostic Learning Performance Optimization

AI-generated Key Points

  • Flan-MoE is introduced to address the demand for efficient and scalable language models
  • Instruction-finetuning is crucial for enhancing MoE model performance
  • Flan-MoE outperforms dense models in various experiment settings
  • Flan-MoE-32B surpasses Flan-PaLM-62B across four benchmarks with one-third of the FLOPs
  • Training data includes 1,836 finetuning tasks from a combination of four mixtures
  • Evaluations conducted through zero-shot and few-shot assessments on held-out tasks
  • Various benchmarks like MMLU, BBH, GSM8K, SVAMP/ASDIV, and StrategyQA are utilized for evaluation purposes
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, Denny Zhou

Preprint
License: CC BY 4.0

Abstract: The explosive growth of language models and their applications have led to an increased demand for efficient and scalable methods. In this paper, we introduce Flan-MoE, a set of Instruction-Finetuned Sparse Mixture-of-Expert (MoE) models. We show that naively finetuning MoE models on a task-specific dataset (in other words, no instruction-finetuning) often yield worse performance compared to dense models of the same computational complexity. However, our Flan-MoE outperforms dense models under multiple experiment settings: instruction-finetuning only and instruction-finetuning followed by task-specific finetuning. This shows that instruction-finetuning is an essential stage for MoE models. Specifically, our largest model, Flan-MoE-32B, surpasses the performance of Flan-PaLM-62B on four benchmarks, while utilizing only one-third of the FLOPs. The success of Flan-MoE encourages rethinking the design of large-scale, high-performance language models, under the setting of task-agnostic learning.

Submitted to arXiv on 24 May. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2305.14705v1

In this paper, the authors introduce Flan-MoE, a set of designed to address the increasing demand for efficient and scalable methods in the realm of language models. The study reveals that simply finetuning MoE models on task-specific datasets without instruction-finetuning often leads to inferior performance compared to dense models with similar computational complexity. However, through their experiments, the researchers demonstrate that Flan-MoE surpasses dense models in various experiment settings, including instruction-finetuning only and instruction-finetuning followed by task-specific finetuning. This highlights the crucial role of instruction-finetuning in enhancing the performance of MoE models. Notably, the largest model in the study, Flan-MoE-32B, outperforms Flan-PaLM-62B across four benchmarks while utilizing only one-third of the FLOPs. This success underscores the importance of rethinking the design of large-scale, high-performance language models within a framework of The training data for all models includes 1,836 finetuning tasks derived from a combination of four mixtures: Muffin (comprising 80 tasks from previous work and 26 dialog/program synthesis tasks), T0-SF (193 tasks), NIV2 (1554 tasks), and CoT (9 reasoning tasks). Evaluations are conducted through zero-shot and few-shot assessments on held-out tasks not included in the finetuning data. Various benchmarks such as MMLU (featuring exam questions from diverse fields), BBH (challenging tasks from BIG-Bench), reasoning benchmarks like GSM8K and SVAMP/ASDIV focusing on math word problems and open-domain questions like StrategyQA are utilized for evaluation purposes. The study reports results for different benchmarks using direct prompting and chain-of-thought prompting approaches.
Created on 01 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.