Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch

AI-generated keywords: Language Models

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors explore how Language Models (LMs) can enhance capabilities through parameter assimilation from similar models without retraining or additional GPUs
  • Study focuses on encoder- or decoder-based LMs and effectiveness of Supervised Fine-Tuning (SFT) in acquiring new abilities through delta parameters
  • Introduction of novel technique called DARE (Drop And REscale) to efficiently set most delta parameters to zero without compromising performance
  • Demonstration of how DARE enables merger of multiple task-specific LMs into one model with diverse abilities
  • Experimental evaluations on eight datasets show significant results, highlighting promising avenues for advancing natural language processing tasks efficiently
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li

24 pages, 21 figures

Abstract: In this paper, we uncover that Language Models (LMs), either encoder- or decoder-based, can obtain new capabilities by assimilating the parameters of homologous models without retraining or GPUs. Typically, new abilities of LMs can be imparted by Supervised Fine-Tuning (SFT), reflected in the disparity between fine-tuned and pre-trained parameters (i.e., delta parameters). We initially observe that by introducing a novel operation called DARE (Drop And REscale), most delta parameters can be directly set to zeros without affecting the capabilities of SFT LMs and larger models can tolerate a higher proportion of discarded parameters. Based on this observation, we further sparsify delta parameters of multiple SFT homologous models with DARE and subsequently merge them into a single model by parameter averaging. We conduct experiments on eight datasets from the GLUE benchmark with BERT and RoBERTa. We also merge WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental results show that: (1) The delta parameter value ranges for SFT models are typically small, often within 0.005, and DARE can eliminate 99% of them effortlessly. However, once the models are continuously pre-trained, the value ranges can grow to around 0.03, making DARE impractical. We have also tried to remove fine-tuned instead of delta parameters and find that a 10% reduction can lead to drastically decreased performance (even to 0). This highlights that SFT merely stimulates the abilities via delta parameters rather than injecting new abilities into LMs; (2) DARE can merge multiple task-specific LMs into one LM with diverse abilities. For instance, the merger of WizardLM and WizardMath improves the GSM8K zero-shot accuracy of WizardLM from 2.2 to 66.3, retaining its instruction-following ability while surpassing WizardMath's original 64.2 performance. Codes are available at https://github.com/yule-BUAA/MergeLM.

Submitted to arXiv on 06 Nov. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2311.03099v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

, , , , In their paper titled "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch," authors Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li explore how Language Models (LMs) can enhance their capabilities through parameter assimilation from similar models without the need for retraining or additional GPUs. The study focuses on encoder- or decoder-based LMs and highlights the effectiveness of Supervised Fine-Tuning (SFT) in acquiring new abilities through delta parameters. To efficiently set most delta parameters to zero without compromising performance, the researchers introduce a novel technique called DARE (Drop And REscale). They also demonstrate how DARE enables the merger of multiple task-specific LMs into one model with diverse abilities. Experimental evaluations on eight datasets show significant results, highlighting promising avenues for advancing natural language processing tasks efficiently.
Created on 25 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.