Little Giants: Exploring the Potential of Small LLMs as Evaluation Metrics in Summarization in the Eval4NLP 2023 Shared Task

AI-generated keywords: Eval4NLP Quality Estimation Prompt-based Techniques LLMs Codalab

AI-generated Key Points

  • Participation in the 2023 Eval4NLP shared task on evaluating prompt-based techniques for enhancing Large Language Models (LLMs) in quality estimation tasks
  • Systematic experiments conducted using various prompting techniques, including standard prompting, prompts informed by annotator instructions, and innovative chain-of-thought prompting
  • Integration of prompting approaches with zero-shot and one-shot learning methods to optimize evaluation procedures
  • Development of a set of instructions to evaluate the quality of news article summaries based on coherence, consistency, fluency, and relevance
  • Refinement of prompts through manual prompt rewriting, instruction enhancement via LLMs, and prompt refinement through paraphrasing
  • Experimentation with prompts instructing LLMs to output scores and explanations, but poor results obtained
  • Utilization of Codalab as the platform for submitting system entries for evaluation purposes
  • Use of Kendall rank coefficient as an evaluation metric due to its suitability for situations with specific assumptions or small sample sizes
  • Employment of Core Prompts, prompt refinement, and further prompt refinement strategies to enhance prompt effectiveness and interpretability
  • Combination of approaches using a "small" open-source model (orca_mini_v3_7B) resulting in competitive results
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Neema Kotonya, Saran Krishnasamy, Joel Tetreault, Alejandro Jaimes

Eval4NLP 2023 Shared Task
License: CC BY 4.0

Abstract: This paper describes and analyzes our participation in the 2023 Eval4NLP shared task, which focuses on assessing the effectiveness of prompt-based techniques to empower Large Language Models to handle the task of quality estimation, particularly in the context of evaluating machine translations and summaries. We conducted systematic experiments with various prompting techniques, including standard prompting, prompts informed by annotator instructions, and innovative chain-of-thought prompting. In addition, we integrated these approaches with zero-shot and one-shot learning methods to maximize the efficacy of our evaluation procedures. Our work reveals that combining these approaches using a "small", open source model (orca_mini_v3_7B) yields competitive results.

Submitted to arXiv on 01 Nov. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2311.00686v1

This paper describes our participation in the 2023 Eval4NLP shared task, which focuses on evaluating prompt-based techniques to enhance the performance of Large Language Models (LLMs) in quality estimation tasks, specifically for machine translations and summaries. We conducted systematic experiments using various prompting techniques, including standard prompting, prompts informed by annotator instructions, and innovative chain-of-thought prompting. Additionally, we integrated these approaches with zero-shot and one-shot learning methods to optimize our evaluation procedures. To evaluate the quality of news article summaries based on coherence, consistency, fluency, and relevance for each sentence in the summary with respect to the article, we developed a set of instructions. Each aspect was scored on a scale from 1 (worst) to 5 (best). We refined these prompts using three key strategies: manual prompt rewriting; instruction enhancement via LLMs; and prompt refinement through paraphrasing. In manual prompt rewriting, we meticulously rewrote the instructions to elicit fine-grained answers and employed templates specifying the desired answer format. We also experimented with prompts that instructed the LLM to output both scores and explanations; however, we found that prompting for explanations alongside quality estimation yielded poor results. In instruction enhancement via LLMs strategy, we provided a seed prompt as context and prompted a separate LLM to enhance the existing instructions. Various phrases such as "Improve the following instructions" or "Rewrite the following instructions to yield better responses" were used. For evaluation purposes, we utilized Codalab as the platform for submitting our system entries. The organizers of the shared task provided direct assessment baselines for LLMs as reference points for evaluating system performance. The Kendall rank coefficient was used as an evaluation metric due to its suitability for situations where data does not meet specific assumptions or when dealing with small sample sizes. We employed three main classes of strategies to enhance prompt effectiveness and interpretability: Core Prompts; prompt refinement; and further prompt refinement. Core Prompts involved one-step methods for generating prompts while prompt refinement focused on manual and automatic methods to refine them. Finally, further prompt refinement outlined two simple approaches to improve generated prompts. Overall our experiments showed that combining these approaches using a "small" open-source model (orca_mini_v3_7B) resulted in competitive results. We believe that our findings contribute to the field of quality estimation and demonstrate the effectiveness of prompt-based techniques in empowering LLMs for handling such tasks.
Created on 26 Dec. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.