Universal and Transferable Adversarial Attacks on Aligned Language Models

AI-generated keywords: Adversarial Attacks

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors propose a simple yet effective attack method for generating objectionable behaviors in large language models (LLMs)
  • Approach involves finding a suffix that maximizes the probability of an affirmative response instead of refusing to answer
  • Adversarial prompts generated by their approach are highly transferable even to black-box publicly released LLMs
  • Attack suffix successfully induces objectionable content in various public interfaces and open-source LLMs
  • This work advances the state-of-the-art in adversarial attacks against aligned language models
  • Raises important questions about preventing LLMs from generating objectionable information
  • Code for the approach is available at github.com/llm-attacks/llm-attacks
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson

Abstract: Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

Submitted to arXiv on 27 Jul. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2307.15043v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "Universal and Transferable Adversarial Attacks on Aligned Language Models," authors Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson address the issue of objectionable content generated by large language models (LLMs). While efforts have been made to align these models to prevent undesirable outputs, previous attacks against LLMs have required significant human ingenuity and are not practical in real-world scenarios. To tackle this problem, the authors propose a simple yet effective attack method that causes aligned language models to generate objectionable behaviors. Their approach involves finding a suffix that can be attached to various queries for an LLM, aiming to maximize the probability of the model producing an affirmative response instead of refusing to answer. Unlike manual engineering approaches, their method automatically generates these adversarial suffixes using a combination of greedy and gradient-based search techniques, improving upon past automatic prompt generation methods. Surprisingly, the authors find that the adversarial prompts generated by their approach are highly transferable even to black-box publicly released LLMs. They train an adversarial attack suffix on multiple prompts asking for different types of objectionable content and multiple models such as Vicuna-7B and 13B. The resulting attack suffix successfully induces objectionable content in public interfaces like ChatGPT, Bard, Claude, as well as open-source LLMs including LLaMA-2-Chat, Pythia, Falcon among others. This work significantly advances the state-of-the-art in adversarial attacks against aligned language models and raises important questions about preventing such systems from generating objectionable information. The authors provide code for their approach at github.com/llm-attacks/llm-attacks.
Created on 10 Aug. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.