This paper describes our participation in the 2023 Eval4NLP shared task, which focuses on evaluating prompt-based techniques to enhance the performance of Large Language Models (LLMs) in quality estimation tasks, specifically for machine translations and summaries. We conducted systematic experiments using various prompting techniques, including standard prompting, prompts informed by annotator instructions, and innovative chain-of-thought prompting. Additionally, we integrated these approaches with zero-shot and one-shot learning methods to optimize our evaluation procedures. To evaluate the quality of news article summaries based on coherence, consistency, fluency, and relevance for each sentence in the summary with respect to the article, we developed a set of instructions. Each aspect was scored on a scale from 1 (worst) to 5 (best). We refined these prompts using three key strategies: manual prompt rewriting; instruction enhancement via LLMs; and prompt refinement through paraphrasing. In manual prompt rewriting, we meticulously rewrote the instructions to elicit fine-grained answers and employed templates specifying the desired answer format. We also experimented with prompts that instructed the LLM to output both scores and explanations; however, we found that prompting for explanations alongside quality estimation yielded poor results. In instruction enhancement via LLMs strategy, we provided a seed prompt as context and prompted a separate LLM to enhance the existing instructions. Various phrases such as "Improve the following instructions" or "Rewrite the following instructions to yield better responses" were used. For evaluation purposes, we utilized Codalab as the platform for submitting our system entries. The organizers of the shared task provided direct assessment baselines for LLMs as reference points for evaluating system performance. The Kendall rank coefficient was used as an evaluation metric due to its suitability for situations where data does not meet specific assumptions or when dealing with small sample sizes. We employed three main classes of strategies to enhance prompt effectiveness and interpretability: Core Prompts; prompt refinement; and further prompt refinement. Core Prompts involved one-step methods for generating prompts while prompt refinement focused on manual and automatic methods to refine them. Finally, further prompt refinement outlined two simple approaches to improve generated prompts. Overall our experiments showed that combining these approaches using a "small" open-source model (orca_mini_v3_7B) resulted in competitive results. We believe that our findings contribute to the field of quality estimation and demonstrate the effectiveness of prompt-based techniques in empowering LLMs for handling such tasks.
- - Participation in the 2023 Eval4NLP shared task on evaluating prompt-based techniques for enhancing Large Language Models (LLMs) in quality estimation tasks
- - Systematic experiments conducted using various prompting techniques, including standard prompting, prompts informed by annotator instructions, and innovative chain-of-thought prompting
- - Integration of prompting approaches with zero-shot and one-shot learning methods to optimize evaluation procedures
- - Development of a set of instructions to evaluate the quality of news article summaries based on coherence, consistency, fluency, and relevance
- - Refinement of prompts through manual prompt rewriting, instruction enhancement via LLMs, and prompt refinement through paraphrasing
- - Experimentation with prompts instructing LLMs to output scores and explanations, but poor results obtained
- - Utilization of Codalab as the platform for submitting system entries for evaluation purposes
- - Use of Kendall rank coefficient as an evaluation metric due to its suitability for situations with specific assumptions or small sample sizes
- - Employment of Core Prompts, prompt refinement, and further prompt refinement strategies to enhance prompt effectiveness and interpretability
- - Combination of approaches using a "small" open-source model (orca_mini_v3_7B) resulting in competitive results
In 2023, there was a competition to make language models better. People tried different ways to give the models instructions, like normal ones and new creative ones. They also used methods that can learn from just a few examples or without any examples at all. They made a set of rules to check if news summaries are good or not. They improved the instructions by rewriting them and using language models. They tried making the models explain their answers, but it didn't work well. They used a website called Codalab to submit their work for evaluation. They used a special way to measure how good the models are because it works well with small groups of data. They combined different methods using a small model and got good results."
Definitions- Large Language Models (LLMs): Big computer programs that can understand and generate human-like text.
- Evaluation: Checking how good something is.
- Prompts: Instructions given to language models.
- Coherence: How well ideas fit together.
- Consistency: Staying the same throughout.
- Fluency: Speaking or writing smoothly.
- Relevance: Being related or important to something.
- Paraphrasing: Saying something in a different way but keeping the same meaning.
- Codalab: A website where people can share and evaluate their work.
- Kendall rank coefficient: A way to measure how things compare when there are only a few options.
- Core Prompts: The most important instructions given to language models.
Evaluating Prompt-Based Techniques to Enhance Large Language Models for Quality Estimation Tasks
In the 2023 Eval4NLP shared task, researchers from various institutions conducted systematic experiments to evaluate prompt-based techniques that can enhance the performance of large language models (LLMs) in quality estimation tasks. This article will discuss the strategies used, results obtained, and implications of this research.
Background
The goal of this project was to evaluate LLMs in quality estimation tasks such as machine translations and summaries. To do so, a set of instructions were developed which scored each aspect of news article summaries on a scale from 1 (worst) to 5 (best). These aspects included coherence, consistency, fluency, and relevance for each sentence in the summary with respect to the article. The organizers provided direct assessment baselines for LLMs as reference points for evaluating system performance. The Kendall rank coefficient was chosen as an evaluation metric due to its suitability for situations where data does not meet specific assumptions or when dealing with small sample sizes.
Experimental Strategies
Three main classes of strategies were employed by researchers during their experiments: Core Prompts; prompt refinement; and further prompt refinement. Core Prompts involved one-step methods for generating prompts while prompt refinement focused on manual and automatic methods to refine them. Finally, further prompt refinement outlined two simple approaches to improve generated prompts.
For manual prompt rewriting, templates specifying desired answer formats were used along with meticulous rewrites of instructions that elicited fine-grained answers. Additionally, prompting LLMs for explanations alongside quality estimation yielded poor results so it was not pursued further in this research project. In instruction enhancement via LLMs strategy seed prompts were provided as context and prompted a separate LLM to enhance existing instructions using phrases such as "Improve the following instructions" or "Rewrite the following instructions to yield better responses".
Results & Implications
Overall our experiments showed that combining these approaches using a "small" open-source model (orca_mini_v3_7B) resulted in competitive results compared with direct assessment baselines provided by organizers of the shared task . We believe that our findings contribute significantly towards understanding how best we can utilize large language models in quality estimation tasks such as machine translations and summarization through effective use of prompting techniques combined with zero-shot/one-shot learning methods . Our work also provides insights into how we can optimize evaluation procedures by refining existing instruction sets through manual rewriting , instruction enhancement via LLMs ,and paraphrasing .