Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic

AI-generated keywords: Language Models Fine-tuning Safety RESTA Multilingual Benchmarks

AI-generated Key Points

  • Authors address the challenge of compromised safety in fine-tuned language models
  • Introduce RESTA method to restore safety through task arithmetic by adding a safety vector to model weights
  • Effectiveness of RESTA demonstrated in parameter-efficient and full fine-tuning scenarios across various tasks (instruction following, problem-solving) in Chinese, English, Hindi
  • Generalizability of RESTA shown on existing safety evaluation benchmarks and multilingual benchmark dataset with harmful questions
  • RESTA significantly reduces harmfulness of compromised models while maintaining task performance
  • Source codes for RESTA available on GitHub
  • Promising results for RESTA's impact on Chinese and Vietnamese languages
  • Performance comparisons show improved task-specific scores with RESTA compared to base model Llama-2, SFT with dropout DARE, and their combination (RESTAd)
  • Analysis on CATQA dataset reveals significant increases in unsafety scores for Llama-2 safe model under different evaluations
  • Research provides insights into enhancing language model safety during fine-tuning processes using arithmetic methods like RESTA
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Rishabh Bhardwaj, Do Duc Anh, Soujanya Poria

License: CC BY-SA 4.0

Abstract: Aligned language models face a significant limitation as their fine-tuning often results in compromised safety. To tackle this, we propose a simple method RESTA that performs LLM safety realignment. RESTA stands for REstoring Safety through Task Arithmetic. At its core, it involves a simple arithmetic addition of a safety vector to the weights of the compromised model. We demonstrate the effectiveness of RESTA in both parameter-efficient and full fine-tuning, covering a wide range of downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math. We also showcase the generalizability of RESTA on three existing safety evaluation benchmarks and a multilingual benchmark dataset proposed as a part of this work, consisting of 550 harmful questions covering 11 categories, each with 5 sub-categories of harm. Overall, RESTA decreases the harmfulness of the compromised model from 18.6% to 5.1% and from 9.2% to 1.5% in parameter-efficient and full fine-tuning, respectively, while maintaining most of the model's performance on the task. We release the source codes at: https://github.com/declare-lab/resta.

Submitted to arXiv on 19 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.11746v1

In their paper titled "Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic," authors Rishabh Bhardwaj, Do Duc Anh, and Soujanya Poria address the challenge faced by aligned language models where fine-tuning often compromises safety. They introduce a method called RESTA (REstoring Safety through Task Arithmetic) to overcome this limitation by adding a safety vector to the weights of the compromised model. The effectiveness of RESTA is demonstrated in both parameter-efficient and full fine-tuning scenarios across various downstream tasks such as instruction following in Chinese, English, and Hindi, as well as problem-solving in Code and Math. The study also showcases the generalizability of RESTA on existing safety evaluation benchmarks and introduces a multilingual benchmark dataset with harmful questions across different categories. Overall, RESTA significantly reduces the harmfulness of compromised models while maintaining task performance. Source codes for RESTA are provided on GitHub. Further evaluations on Chinese and Vietnamese languages show promising results for RESTA's impact. Performance comparisons between base model Llama-2, SFT with dropout DARE, added safety vector RESTA, and their combination (RESTAd) reveal improved task-specific performance scores. Additionally, analysis on versions of the safety evaluation dataset CATQA highlights significant increases in unsafety scores for the Llama-2 safe model when subjected to different evaluations. This research provides valuable insights into enhancing the safety of language models during fine-tuning processes and demonstrates the efficacy of using arithmetic methods like RESTA to realign model safety across diverse tasks and languages.
Created on 27 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.