Scaling Instruction-Finetuned Language Models

AI-generated keywords: Instruction finetuning Language models Model performance Generalization Pretrained language models

AI-generated Key Points

  • Instruction finetuning improves performance and generalization of language models
  • Three aspects focused on: scaling number of tasks, scaling model size, and finetuning on chain-of-thought data
  • Results show significant improvement across various model classes, setups, and evaluation benchmarks
  • Flan-PaLM 540B instruction-finetuned outperforms PALM 540B by a large margin (+9.4% on average)
  • Flan-PaLM achieves state-of-the-art performance on several benchmarks
  • Flan-T5 checkpoints demonstrate strong few-shot performance compared to larger models like PaLM 62B
  • Instruction finetuning is a valuable method for improving pretrained language models with minimal computational cost
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei

Public checkpoints: https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints
License: CC BY 4.0

Abstract: Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

Submitted to arXiv on 20 Oct. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2210.11416v1

In this paper, the authors explore the concept of instruction finetuning for language models and its impact on model performance and generalization to unseen tasks. They specifically focus on three aspects: scaling the number of tasks, scaling the model size, and finetuning on chain-of-thought data. The results show that instruction finetuning significantly improves performance across various model classes (PaLM, T5, U-PaLM), setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For example, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average) and achieves state-of-the-art performance on several benchmarks. Additionally, the authors release Flan-T5 checkpoints that demonstrate strong few-shot performance compared to larger models like PaLM 62B. The study concludes that instruction finetuning is a valuable method for improving the performance and usability of pretrained language models with minimal computational cost.
Created on 17 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.