Attack Prompt Generation for Red Teaming and Defending Large Language Models

AI-generated keywords: Red Teaming Attacks Large Language Models Attack Prompt Generation Defense Framework SAP Datasets

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors address vulnerability of large language models (LLMs) to red teaming attacks
  • Proposed integrated approach combines manual and automatic methods for generating high-quality attack prompts
  • Introduction of attack framework guiding LLMs to emulate human-generated prompts through in-context learning
  • Defense framework proposed to fine-tune victim LLMs iteratively to enhance resilience against attacks
  • Extensive experiments validate effectiveness of attack and defense frameworks on various LLMs
  • Release of attack prompt datasets named SAP in different sizes for evaluation and improvement of safety measures
  • Code and dataset available on GitHub for further exploration and implementation
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Boyi Deng, Wenjie Wang, Fuli Feng, Yang Deng, Qifan Wang, Xiangnan He

Accepted to EMNLP 2023 (Findings)

Abstract: Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. Previous research constructs attack prompts via manual or automatic methods, which have their own limitations on construction cost and quality. To address these issues, we propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts. Specifically, considering the impressive capabilities of newly emerged LLMs, we propose an attack framework to instruct LLMs to mimic human-generated prompts through in-context learning. Furthermore, we propose a defense framework that fine-tunes victim LLMs through iterative interactions with the attack framework to enhance their safety against red teaming attacks. Extensive experiments on different LLMs validate the effectiveness of our proposed attack and defense frameworks. Additionally, we release a series of attack prompts datasets named SAP with varying sizes, facilitating the safety evaluation and enhancement of more LLMs. Our code and dataset is available on https://github.com/Aatrox103/SAP .

Submitted to arXiv on 19 Oct. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2310.12505v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "Attack Prompt Generation for Red Teaming and Defending Large Language Models," authors Boyi Deng, Wenjie Wang, Fuli Feng, Yang Deng, Qifan Wang, and Xiangnan He address the vulnerability of large language models (LLMs) to red teaming attacks that can lead them to produce harmful content. The authors propose an integrated approach that combines manual and automatic methods to efficiently generate high-quality attack prompts in order to overcome limitations in cost and quality. Leveraging the advanced capabilities of newly emerged LLMs, they introduce an attack framework that guides LLMs to emulate human-generated prompts through in-context learning. Additionally, a defense framework is proposed to fine-tune victim LLMs by iteratively interacting with the attack framework to enhance their resilience against red teaming attacks. Extensive experiments conducted on various LLMs validate the effectiveness of the proposed attack and defense frameworks. Furthermore, the authors release a series of attack prompt datasets named SAP in different sizes to facilitate evaluation and improvement of safety measures for a broader range of LLMs. The code and dataset are available on GitHub for further exploration and implementation. This research was accepted at EMNLP 2023 (Findings) and significantly contributes to enhancing the security of large language models against potential adversarial attacks.
Created on 21 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.