Sparse Autoencoders Can Interpret Randomly Initialized Transformers

AI-generated keywords: Sparse Autoencoders

AI-generated Key Points

  • Researchers explored the application of sparse autoencoders (SAEs) in interpreting randomly initialized transformers
  • Concept of auto-interpretability involves using feature explanations to identify specific concepts within text for classification tasks
  • 'Fuzzing' scoring method and simulation scoring were used to evaluate feature explanations
  • Both random and trained transformers yield similarly interpretable SAE latents when analyzed using an open-source auto-interpretability pipeline
  • SAE quality metrics are comparable between random and trained transformers across different model sizes and layers
  • Auto-interpretability scores were notably consistent between trained and randomized transformer models compared to control conditions, indicating potential of SAEs in interpreting transformer models regardless of their initialization method
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Thomas Heap, Tim Lawson, Lucy Farnik, Laurence Aitchison

License: CC BY 4.0

Abstract: Sparse autoencoders (SAEs) are an increasingly popular technique for interpreting the internal representations of transformers. In this paper, we apply SAEs to 'interpret' random transformers, i.e., transformers where the parameters are sampled IID from a Gaussian rather than trained on text data. We find that random and trained transformers produce similarly interpretable SAE latents, and we confirm this finding quantitatively using an open-source auto-interpretability pipeline. Further, we find that SAE quality metrics are broadly similar for random and trained transformers. We find that these results hold across model sizes and layers. We discuss a number of number interesting questions that this work raises for the use of SAEs and auto-interpretability in the context of mechanistic interpretability.

Submitted to arXiv on 29 Jan. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2501.17727v1

In the study "Sparse Autoencoders Can Interpret Randomly Initialized Transformers," researchers Thomas Heap, Tim Lawson, Lucy Farnik, and Laurence Aitchison explore the application of sparse autoencoders (SAEs) in interpreting randomly initialized transformers. The authors delve into the concept of auto-interpretability, which involves using feature explanations to identify specific concepts within text for classification tasks. They employed the 'fuzzing' scoring method to evaluate feature explanations by prompting a language model to distinguish correctly delimited examples of tokens with non-zero and zero activation values for a given latent explanation. Furthermore, simulation scoring based on the correlation between simulated and observed activations was utilized for evaluation. The researchers found that both random and trained transformers yield similarly interpretable SAE latents when analyzed using an open-source auto-interpretability pipeline. Additionally, they observed that SAE quality metrics are comparable between random and trained transformers across different model sizes and layers. By randomly sampling 100 features from each trained SAE model variant and layer, the researchers generated auto-interpretability scores using an implementation based on previous work by Paulo et al. The Meta-Llama-3.1-70B-Instruct-AWQ-INT4 model was employed for generating explanations and predictions, showcasing larger capacity compared to previous models used in similar studies. Overall, the study revealed that auto-interpretability scores were notably consistent between trained and randomized transformer models compared to control conditions. These findings shed light on the potential of SAEs in interpreting transformer models regardless of their initialization method, raising intriguing questions about mechanistic interpretability in natural language processing applications.
Created on 03 Feb. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.