Predicting the Performance of Black-box LLMs through Self-Queries

AI-generated keywords: AI systems large language models black-box access predicting model behavior self-query methods

AI-generated Key Points

  • The reliance on large language models (LLMs) in AI systems is increasing.
  • Predicting when LLMs may make mistakes is crucial.
  • Accessing internal representations of LLMs is challenging with black-box access through an API.
  • A novel approach using follow-up prompts and analyzing response probabilities can extract features of LLMs in a black-box manner.
  • Training a linear model on these low-dimensional representations yields reliable and generalizable predictors of model performance at the instance level.
  • These predictors can determine if a specific generation correctly answers a question and outperform white-box linear predictors.
  • Extracted features can help evaluate nuanced aspects of LLMs, such as distinguishing between clean versions and those influenced by adversarial prompts or bugs in generated code.
  • The features are effective in distinguishing between different model architectures and sizes, enabling detection of misrepresented models provided through an API.
  • By predicting model behavior through self-queries and leveraging small amounts of labeled data, this approach offers promising results comparable to examining activations directly.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Dylan Sam, Marc Finzi, J. Zico Kolter

28 pages
License: CC BY 4.0

Abstract: As large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial. While a great deal of work in the field uses internal representations to interpret model behavior, these representations are inaccessible when given solely black-box access through an API. In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations to train reliable predictors of model behavior. We demonstrate that training a linear model on these low-dimensional representations produces reliable and generalizable predictors of model performance at the instance level (e.g., if a particular generation correctly answers a question). Remarkably, these can often outperform white-box linear predictors that operate over a model's hidden state or the full distribution over its vocabulary. In addition, we demonstrate that these extracted features can be used to evaluate more nuanced aspects of a language model's state. For instance, they can be used to distinguish between a clean version of GPT-4o-mini and a version that has been influenced via an adversarial system prompt that answers question-answering tasks incorrectly or introduces bugs into generated code. Furthermore, they can reliably distinguish between different model architectures and sizes, enabling the detection of misrepresented models provided through an API (e.g., identifying if GPT-3.5 is supplied instead of GPT-4o-mini).

Submitted to arXiv on 02 Jan. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2501.01558v1

In the realm of AI systems, the reliance on large language models (LLMs) is steadily increasing. It has become imperative to predict when these models may make mistakes. Previous research has delved into interpreting model behavior using internal representations. However, accessing these representations becomes challenging when only provided with black-box access through an API. This paper introduces a novel approach to extracting features of LLMs in a black-box manner by utilizing follow-up prompts and analyzing the probabilities of different responses as representations to train reliable predictors of model behavior. The study demonstrates that training a linear model on these low-dimensional representations yields dependable and generalizable predictors of model performance at the instance level. These predictors can determine if a specific generation correctly answers a question. Surprisingly, they often outperform white-box linear predictors that operate over a model's hidden state or the full distribution over its vocabulary. Furthermore, the extracted features can be utilized to evaluate more nuanced aspects of a language model's state. This includes distinguishing between clean versions of LLMs and those influenced by adversarial system prompts that lead to incorrect question-answering tasks or introduce bugs into generated code. Additionally, the extracted features prove effective in distinguishing between different model architectures and sizes, enabling the detection of misrepresented models provided through an API. By predicting model behavior and performance through self-queries and leveraging small amounts of labeled data from downstream tasks, this work offers promising results that can sometimes match the performance achieved by examining activations directly.
Created on 26 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.