Jailbreaking Black Box Large Language Models in Twenty Queries

AI-generated keywords: Large Language Models Human Values Adversarial Jailbreaks Prompt Automatic Iterative Refinement (PAIR) JUDGE Function

AI-generated Key Points

  • Growing interest in ensuring Large Language Models (LLMs) align with human values
  • Vulnerabilities in LLM alignment process can lead to adversarial jailbreaks
  • Introduction of Prompt Automatic Iterative Refinement (PAIR) algorithm to address vulnerabilities
  • PAIR generates semantic jailbreaks using social engineering tactics and black-box access to LLM
  • System prompt for PAIR guides attacker LLM to act as red team against target LLM, emphasizing role-playing and emotional manipulation
  • PAIR iteratively refines candidate jailbreaks based on previous prompts, responses, and scores
  • Experimental evaluations show PAIR's efficacy in jailbreaking open-source and closed-source LLMs with competitive success rates
  • Challenges in evaluating jailbreaking attacks due to complexity of generating semantically rich content
  • Proposal of JUDGE function parameterized by an LLM to assess candidate jailbreaking prompts based on creativity and semantics
  • Strategies like PAIR offer insights into enhancing model robustness against adversarial attacks while promoting ethical alignment with human values
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong

21 pages, 10 figures
License: CC BY 4.0

Abstract: There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR -- which is inspired by social engineering attacks -- uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker LLM iteratively queries the target LLM to update and refine a candidate jailbreak. Empirically, PAIR often requires fewer than twenty queries to produce a jailbreak, which is orders of magnitude more efficient than existing algorithms. PAIR also achieves competitive jailbreaking success rates and transferability on open and closed-source LLMs, including GPT-3.5/4, Vicuna, and PaLM-2.

Submitted to arXiv on 12 Oct. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2310.08419v1

In the realm of Large Language Models (LLMs), there is a growing interest in ensuring that these models align with human values. The alignment process is susceptible to adversarial jailbreaks, where LLMs can be coerced into bypassing their safety mechanisms. Identifying and understanding these vulnerabilities are crucial in preventing potential misuse of LLMs. To address this issue, a new algorithm called Prompt Automatic Iterative Refinement (PAIR) has been proposed. PAIR operates by generating semantic jailbreaks using only black-box access to an LLM, inspired by social engineering tactics. The system prompt for PAIR guides the attacker LLM to act as a red team against the target LLM, emphasizing the use of social engineering techniques like role-playing and emotional manipulation in crafting jailbreaking prompts. The algorithm iteratively refines candidate jailbreaks by analyzing previous prompts, responses, and scores to identify areas for improvement. In experimental evaluations using the "harmful behaviors" subset of the AdvBench benchmark, PAIR demonstrated its efficacy in jailbreaking both open-source LLMs like Vicuna-13B-v1.5 and closed-source LLMs such as GPT-3.5 and 4. By curating a representative subset of harmful behavior objectives and utilizing deterministic generation with specific temperature settings, PAIR achieved competitive success rates in producing jailbreak prompts with minimal queries. Evaluating the performance of jailbreaking attacks poses challenges due to the complexity of generating semantically rich content. To address this issue, a JUDGE function parameterized by an LLM was proposed to assess candidate jailbreaking prompts based on creativity and semantics. This approach leverages system prompts to instruct the LLM to score responses on a scale from 1 to 10, indicating the level of detail in a fully jailbroken response. As language models continue to advance, strategies like PAIR offer insights into enhancing model robustness against adversarial attacks while promoting ethical alignment with human values. Through iterative refinement processes guided by system prompts and automated generation techniques, PAIR showcases promising results in efficiently identifying vulnerabilities within LLMs for improved security measures.
Created on 04 Jun. 2024

Assess the quality of the AI-generated content by voting

Score: 1

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.