Sparse Autoencoders Trained on the Same Data Learn Different Features

AI-generated keywords: Mechanistic Interpretability

AI-generated Key Points

  • Sparse Autoencoders (SAEs) used in Mechanistic Interpretability to reveal human-interpretable features in large language models (LLMs)
  • SAEs trained with different random seeds for weight initialization identify distinct feature sets
  • ReLU SAEs with L1 sparsity loss show greater stability across seeds, while TopK activation function SAEs are more seed-dependent
  • Features uncovered by SAEs should be viewed as a practical decomposition of activation space, not an exhaustive list of truly utilized features
  • Standard SAE designs may not capture the hierarchical structure inherent in human concepts adequately
  • Debate on whether neural networks rely solely on linear representations or nonlinear features also play a crucial role
  • Study highlights limitations and complexities in interpreting feature extraction through sparse autoencoders in large language models
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Gonçalo Paulo, Nora Belrose

License: CC BY 4.0

Abstract: Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows that SAEs trained on the same model and data, differing only in the random seed used to initialize their weights, identify different sets of features. For example, in an SAE with 131K latents trained on a feedforward network in Llama 3 8B, only 30% of the features were shared across different seeds. We observed this phenomenon across multiple layers of three different LLMs, two datasets, and several SAE architectures. While ReLU SAEs trained with the L1 sparsity loss showed greater stability across seeds, SAEs using the state-of-the-art TopK activation function were more seed-dependent, even when controlling for the level of sparsity. Our results suggest that the set of features uncovered by an SAE should be viewed as a pragmatically useful decomposition of activation space, rather than an exhaustive and universal list of features "truly used" by the model.

Submitted to arXiv on 28 Jan. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2501.16615v2

, , , , In the field of Mechanistic Interpretability, Sparse Autoencoders (SAEs) have been widely used to uncover human-interpretable features in the activations of large language models (LLMs). However, a recent study has revealed that SAEs trained on the same model and data, but with different random seeds for weight initialization, identify distinct sets of features. This phenomenon was observed across multiple layers of various LLMs, datasets, and SAE architectures. While ReLU SAEs trained with the L1 sparsity loss exhibited greater stability across seeds, those using the TopK activation function were more seed-dependent even when controlling for sparsity levels. These findings suggest that the set of features uncovered by an SAE should be seen as a practical decomposition of activation space rather than an exhaustive list of features truly utilized by the model. Furthermore, there is evidence suggesting that standard SAE designs may not adequately capture the hierarchical structure inherent in human concepts. Some studies have raised questions about whether neural networks solely rely on linear representations or if nonlinear features also play a crucial role. Overall, this study sheds light on the limitations and complexities associated with interpreting feature extraction through sparse autoencoders in large language models. The findings emphasize the need for a nuanced understanding of how these models operate and highlight the importance of considering alternative approaches to uncovering meaningful insights from neural network activations.
Created on 03 Feb. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.