A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
AI-generated Key Points
- Comprehensive framework for evaluating Interactive Large Language Models (LLMs) like ChatGPT
- Performance assessment across 23 publicly available datasets covering 8 common NLP tasks
- Introduction of a new multimodal dataset for evaluation
- ChatGPT outperforms other LLMs in zero-shot learning and some fine-tuned models
- Struggles with generating non-Latin script languages and faces hallucination issues due to limited access to external knowledge bases
- Evaluation of reasoning abilities, proficiency in logical, non-textual, and commonsense reasoning tasks
- Unreliable reasoner, better in deductive than inductive reasoning tasks
- Human collaboration through prompt engineering strategies to enhance performance
- Details on various NLP tasks such as summarization, machine translation (MT), sentiment analysis, question answering, misinformation detection, task-oriented dialogue systems, and open-domain dialogue systems
- Emphasis on ethical considerations in generative AI model development: fairness, toxicity, demographic bias, safety
- Acknowledgment of funding sources including government grants in Hong Kong
- Compliance with data usage licenses and terms outlined by OpenAI and other relevant entities involved
Authors: Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, Pascale Fung
Abstract: This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts, via an intermediate code generation step. Moreover, we find that ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning. ChatGPT suffers from hallucination problems like other LLMs and it generates more extrinsic hallucinations from its parametric memory as it does not have access to an external knowledge base. Finally, the interactive feature of ChatGPT enables human collaboration with the underlying LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation, in a multi-turn "prompt engineering" fashion. We also release codebase for evaluation set extraction.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.