AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

AI-generated keywords: Natural Language Processing

AI-generated Key Points

  • Fine-grained steering of language model outputs is crucial for safety and reliability in natural language processing.
  • Common techniques used for steering include prompting and finetuning, as well as representation-based methods such as sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning.
  • A lack of a standardized benchmark for comparing these approaches led to the introduction of AxBench, a large-scale benchmark specifically designed for steering and concept detection tasks.
  • Prompting outperformed all existing methods in steering performance, followed by finetuning. However, representation-based methods like difference-in-means showed superior performance in concept detection tasks.
  • Sparse autoencoders (SAEs) were not competitive in either evaluation.
  • A novel weakly-supervised representational method called Rank-1 Representation Finetuning (ReFT-r1) was introduced and proved to be competitive on both steering and concept detection tasks while offering interpretability advantages over prompting.
  • SAE-scale feature dictionaries were trained and publicly released for ReFT-r1 and DiffMean by the researchers.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, Christopher Potts

License: CC BY 4.0

Abstract: Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AxBench, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive. We introduce a novel weakly-supervised representational method (Rank-1 Representation Finetuning; ReFT-r1), which is competitive on both tasks while providing the interpretability advantages that prompting lacks. Along with AxBench, we train and publicly release SAE-scale feature dictionaries for ReFT-r1 and DiffMean.

Submitted to arXiv on 28 Jan. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2501.17148v2

, , , , In the field of natural language processing, fine-grained steering of language model outputs is crucial for ensuring safety and reliability. Prompting and finetuning are commonly used techniques to achieve this, but researchers have also explored representation-based methods such as sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. However, there has been a lack of a standardized benchmark for directly comparing these different approaches. To address this gap, a team of researchers introduced AxBench, a large-scale benchmark specifically designed for steering and concept detection tasks. Through experiments conducted on Gemma-2-2B and 9B datasets, they found that prompting outperformed all existing methods in terms of steering performance, followed by finetuning. On the other hand, for concept detection tasks, representation-based methods like difference-in-means showed superior performance compared to others. Interestingly, sparse autoencoders (SAEs) were not competitive in either evaluation. In addition to evaluating existing methods, the researchers introduced a novel weakly-supervised representational method called Rank-1 Representation Finetuning (ReFT-r1). This new approach proved to be competitive on both steering and concept detection tasks while offering interpretability advantages that prompting lacks. Along with AxBench, the researchers also trained and publicly released SAE-scale feature dictionaries for ReFT-r1 and DiffMean. Overall, the study highlights the importance of fine-grained steering in language models and provides valuable insights into the effectiveness of different steering techniques through comprehensive evaluations on AxBench. The introduction of ReFT-r1 as a promising new method further enriches the landscape of interpretability-focused approaches in language model steering research.
Created on 04 Feb. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.