Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering

AI-generated keywords: Instruction-following models

AI-generated Key Points

  • Authors evaluate performance of instruction-following models in retrieval-augmented settings for QA tasks
  • Models include Llama-2, GPT-3.5, Flan-T5, and Alpaca
  • Investigation done across three diverse QA tasks
  • Automatic and human evaluation used to assess correctness and faithfulness of models
  • Traditional QA metrics like exact match (EM) and F1 found inadequate due to verbosity introduced by retrieved documents
  • Proposed simple token-overlap based and model-based metrics to address this issue
  • Instruction-following models shown to be competitive and sometimes outperform fine-tuned models in terms of correctness
  • However, these models struggle with adherence to provided knowledge and generate responses with hallucinations
  • Authors encourage holistic evaluation of instruction-following models for QA tasks
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Vaibhav Adlakha, Parishad BehnamGhader, Xing Han Lu, Nicholas Meade, Siva Reddy

License: CC BY 4.0

Abstract: Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for information-seeking tasks such as question answering (QA). By simply prepending retrieved documents in its input along with an instruction, these models can be adapted to various information domains and tasks without additional fine-tuning. While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics such as exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we investigate the performance of instruction-following models across three information-seeking QA tasks. We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness), and 2) whether they produce a response based on the provided knowledge (faithfulness). Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness. We then propose simple token-overlap based and model-based metrics that reflect the true performance of these models. Our analysis reveals that instruction-following models are competitive, and sometimes even outperform fine-tuned models for correctness. However, these models struggle to stick to the provided knowledge and often hallucinate in their responses. We hope our work encourages a more holistic evaluation of instruction-following models for QA. Our code and data is available at https://github.com/McGill-NLP/instruct-qa

Submitted to arXiv on 31 Jul. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2307.16877v1

, , , , In this work, the authors evaluate the performance of instruction-following models in retrieval-augmented settings for question answering (QA) tasks. These models, such as Llama-2, GPT-3.5, Flan-T5, and Alpaca, are attractive alternatives to fine-tuned approaches as they can be adapted to various information domains without additional fine-tuning. The authors investigate these models across three diverse QA tasks and use both automatic and human evaluation to assess their correctness (how well they satisfy the user's information need) and faithfulness (whether they produce a response based on the provided knowledge). They highlight the shortcomings of traditional QA metrics like exact match (EM) and F1 for accurately quantifying model performance due to the additional verbosity introduced by retrieved documents. To address this issue, the authors propose simple token-overlap based and model-based metrics that reflect the true performance of these models. Their analysis reveals that instruction-following models are competitive and sometimes even outperform fine-tuned models in terms of correctness. However, these models struggle to adhere to the provided knowledge and often generate responses that contain hallucinations. The authors hope that their work encourages a more holistic evaluation of instruction-following models for QA tasks.
Created on 31 Jan. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.