Steering Llama 2 via Contrastive Activation Addition

AI-generated keywords: Language model steering Contrastive Activation Addition Llama 2 Chat Large Language Models (LLMs) Behavioral question datasets

AI-generated Key Points

  • **Contrastive Activation Addition (CAA)** introduced as a method for enhancing language model steering capabilities
  • CAA modifies activations during forward passes by computing "steering vectors" to control behaviors like factual versus hallucinatory responses
  • Steering vectors are added at all token positions after the user's prompt with positive or negative coefficients for precise behavior control
  • Evaluation of CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks
  • CAA significantly alters model behavior while minimally reducing capabilities compared to traditional methods like finetuning and few-shot prompting
  • Gain deeper insights into CAA's mechanisms and representation of high-level concepts in Large Language Models (LLMs) through activation space interpretation methods
  • Provides illumination on both accurately steering model outputs and inner workings of LLMs
  • Contributor contact information:
  • Nina Rimsky: [email protected]
  • Nick Gabrieli: [email protected]
  • Julian Schulz: [email protected]
  • Meg Tong from Anthropic: [email protected]
  • Evan Hubinger from Anthropic: [email protected]
  • Alexander Matt Turner from the Center for Human-Compatible AI: [email protected]
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner

License: CC BY 4.0

Abstract: We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying activations during their forward passes. CAA computes ``steering vectors'' by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using both multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, outperforms traditional methods like finetuning and few-shot prompting, and minimally reduces capabilities. Moreover, by employing various activation space interpretation methods, we gain deeper insights into CAA's mechanisms. CAA both accurately steers model outputs and also sheds light on how high-level concepts are represented in Large Language Models (LLMs).

Submitted to arXiv on 09 Dec. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2312.06681v1

In their paper titled "Steering Llama 2 via Contrastive Activation Addition," Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner introduce Contrastive Activation Addition (CAA) as an innovative method for enhancing language model steering capabilities. CAA modifies activations during forward passes by computing "steering vectors" that control specific behaviors such as factual versus hallucinatory responses. These vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, providing precise control over the targeted behavior. The authors evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. They demonstrate that CAA significantly alters model behavior while only minimally reducing capabilities compared to traditional methods like finetuning and few-shot prompting. Additionally, through various activation space interpretation methods, they gain deeper insights into CAA's mechanisms and how high-level concepts are represented in Large Language Models (LLMs). This approach not only accurately steers model outputs but also provides illumination on the inner workings of LLMs. Furthermore, contact information for each contributor is provided: Nina Rimsky ([email protected]), Nick Gabrieli ([email protected]), Julian Schulz ([email protected]), Meg Tong from Anthropic ([email protected]), Evan Hubinger from Anthropic ([email protected]), and Alexander Matt Turner from the Center for Human-Compatible AI ([email protected]). , , , , .
Created on 28 Jan. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.