Humans or LLMs as the Judge? A Study on Judgement Biases

AI-generated keywords: Study Framework Biases LLM Judges Human Evaluators

AI-generated Key Points

  • The study developed a novel framework to investigate biases in human and Large Language Model (LLM) judges.
  • Five specific types of biases were focused on, with significant inclinations found in both human and LLM judges.
  • Weaknesses identified can be exploited under LLM judgment.
  • Importance of shedding light on vulnerabilities and biases in both human and LLM judges when evaluating performance for deeper understanding.
  • Emphasis on developing robust evaluation systems and encouraging expanded evaluation efforts using an open-sourced dataset for further research.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang

18 pages
License: CC BY 4.0

Abstract: Adopting human and large language models (LLM) as judges (\textit{a.k.a} human- and LLM-as-a-judge) for evaluating the performance of existing LLMs has recently gained attention. Nonetheless, this approach concurrently introduces potential biases from human and LLM judges, questioning the reliability of the evaluation results. In this paper, we propose a novel framework for investigating 5 types of biases for LLM and human judges. We curate a dataset with 142 samples referring to the revised Bloom's Taxonomy and conduct thousands of human and LLM evaluations. Results show that human and LLM judges are vulnerable to perturbations to various degrees, and that even the most cutting-edge judges possess considerable biases. We further exploit their weakness and conduct attacks on LLM judges. We hope that our work can notify the community of the vulnerability of human- and LLM-as-a-judge against perturbations, as well as the urgency of developing robust evaluation systems.

Submitted to arXiv on 16 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.10669v1

The authors of this study developed a novel framework to investigate biases in human and Large Language Model (LLM) judges. They focused on five specific types of biases and found significant inclinations in both human and LLM judges. The study also revealed that these weaknesses can be exploited under LLM judgment. By shedding light on the vulnerabilities and biases present in both human and LLM judges when evaluating performance, the authors hope to contribute to a deeper understanding of these issues. They emphasize the importance of developing robust evaluation systems and encourage expanded evaluation efforts by leveraging an open-sourced dataset for further research. <br> <br> The study involved 60 human evaluators from various countries who were selected based on specific criteria related to English proficiency, logic skills, and educational background. These evaluators participated in experiments where they were tasked with evaluating the quality of answers to questions. The control group performed 2162 evaluation tasks while the experiment group performed 2367 tasks. Detailed instructions were provided to ensure a clear understanding of the evaluation criteria.<br> <br> Evaluators were instructed to focus solely on the semantic quality of answers and disregard non-semantic factors such as tone or format. They were also given guidelines on how to handle situations where both responses seemed equally good or bad.<br> <br> Overall, this study highlights the presence of biases in both human and LLM judges when evaluating performance. It emphasizes the need for robust evaluation systems and encourages further research using an open-sourced dataset.<br> <br> Through their research, the authors aim to contribute towards a deeper understanding of biases inherent in both LLMs and human evaluators. By encouraging expanded evaluation efforts, they hope to enhance understanding of bias in LLMs and human judgment processes.
Created on 20 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.