Predicting the Performance of Black-box LLMs through Self-Queries
AI-generated Key Points
- The reliance on large language models (LLMs) in AI systems is increasing.
- Predicting when LLMs may make mistakes is crucial.
- Accessing internal representations of LLMs is challenging with black-box access through an API.
- A novel approach using follow-up prompts and analyzing response probabilities can extract features of LLMs in a black-box manner.
- Training a linear model on these low-dimensional representations yields reliable and generalizable predictors of model performance at the instance level.
- These predictors can determine if a specific generation correctly answers a question and outperform white-box linear predictors.
- Extracted features can help evaluate nuanced aspects of LLMs, such as distinguishing between clean versions and those influenced by adversarial prompts or bugs in generated code.
- The features are effective in distinguishing between different model architectures and sizes, enabling detection of misrepresented models provided through an API.
- By predicting model behavior through self-queries and leveraging small amounts of labeled data, this approach offers promising results comparable to examining activations directly.
Authors: Dylan Sam, Marc Finzi, J. Zico Kolter
Abstract: As large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial. While a great deal of work in the field uses internal representations to interpret model behavior, these representations are inaccessible when given solely black-box access through an API. In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations to train reliable predictors of model behavior. We demonstrate that training a linear model on these low-dimensional representations produces reliable and generalizable predictors of model performance at the instance level (e.g., if a particular generation correctly answers a question). Remarkably, these can often outperform white-box linear predictors that operate over a model's hidden state or the full distribution over its vocabulary. In addition, we demonstrate that these extracted features can be used to evaluate more nuanced aspects of a language model's state. For instance, they can be used to distinguish between a clean version of GPT-4o-mini and a version that has been influenced via an adversarial system prompt that answers question-answering tasks incorrectly or introduces bugs into generated code. Furthermore, they can reliably distinguish between different model architectures and sizes, enabling the detection of misrepresented models provided through an API (e.g., identifying if GPT-3.5 is supplied instead of GPT-4o-mini).
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.