Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

AI-generated keywords: Study EvalGen Interface Large Language Models (LLMs) Participants Iterative Nature

AI-generated Key Points

  • Study focused on evaluating Large Language Models (LLMs) using EvalGen interface
  • Participants evaluated NER outputs of a dataset containing 100 tweets
  • Typical workflow involved grading LLM outputs and refining criteria based on EvalGen suggestions
  • Participants found EvalGen helpful for generating assertions but wanted more control due to occasional mistakes
  • Satisfaction expressed with using EvalGen, importance of iterating on criteria and assertions for better alignment with human requirements noted
  • Challenges highlighted: subjective criteria definition, dependency on specific LLM outputs during iterative process
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Shreya Shankar, J. D. Zamfirescu-Pereira, Björn Hartmann, Aditya G. Parameswaran, Ian Arawjo

16 pages, 4 figures, 2 tables
License: CC BY 4.0

Abstract: Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.

Submitted to arXiv on 18 Apr. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2404.12272v1

The study conducted by Shreya Shankar, J.D. Zamfirescu-Pereira, Björn Hartmann, Aditya G. Parameswaran, and Ian Arawjo focused on evaluating Large Language Models (LLMs) using the EvalGen interface. Participants were tasked with evaluating NER outputs of a dataset containing 100 tweets and engaged in a typical workflow involving various activities such as grading LLM outputs and refining criteria based on suggestions from EvalGen. While participants found EvalGen to be a helpful starting point for generating assertions, they also felt the need to exert control over the process due to occasional mistakes made by the tool. Overall, participants expressed satisfaction with using EvalGen but noted the importance of being able to iterate on criteria and assertions for better alignment with human requirements. The study highlighted challenges such as subjective criteria definition and dependency on specific LLM outputs observed during the iterative process of aligning LLM-generated evaluation functions with user grades. The study conducted by Shreya Shankar et al. aimed to evaluate Large Language Models (LLMs) using the EvalGen interface. Participants were tasked with evaluating NER outputs of a dataset containing 100 tweets and engaged in a typical workflow involving activities such as grading LLM outputs and refining criteria based on suggestions from EvalGen. While participants found EvalGen to be a helpful starting point for generating assertions, they also felt the need to exert control over the process due to occasional mistakes made by the tool. Overall, participants expressed satisfaction with using EvalGen but noted the importance of being able to iterate on criteria and assertions for better alignment with human requirements. The study highlighted challenges such as subjective criteria definition and dependency on specific LLM outputs observed during the iterative process of aligning LLM-generated evaluation functions with user grades.
Created on 02 May. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.