Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

AI-generated keywords: Data Contamination

AI-generated Key Points

  • Data contamination in NLP research using LLMs is a growing concern
  • Lack of access to model details and potential for indirect data leaking are major concerns
  • Authors conducted a systematic analysis of OpenAI's GPT-3.5 and GPT-4 models
  • No evidence found suggesting papers opted out of providing data for model improvement purposes
  • Minority of papers provided information on the model version used, which can yield different outputs
  • Evaluations of ChatGPT's performance often unfair due to missing comparisons with open-source LLMs or non-LLM-based methods
  • Authors' findings highlight lack of information on model versions used and unfair evaluation practices
  • Results made available as a collaborative project for other researchers to contribute to addressing these issues.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, Ondřej Dušek

License: CC BY 4.0

Abstract: Natural Language Processing (NLP) research is increasingly focusing on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of \emph{indirect} data leaking, where models are iteratively improved by using data coming from users. In this work, we conduct the first systematic analysis of work using OpenAI's GPT-3.5 and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI's data usage policy, we extensively document the amount of data leaked to these models during the first year after the model's release. We report that these models have been globally exposed to $\sim$4.7M samples from 263 benchmarks. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, such as unfair or missing baseline comparisons and reproducibility issues. We release our results as a collaborative project on https://leak-llm.github.io/, where other researchers can contribute to our efforts.

Submitted to arXiv on 06 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.03927v1

The issue of data contamination in Natural Language Processing (NLP) research using Large Language Models (LLMs), particularly closed-source models, is a growing concern. The lack of access to model details, including training data, and the potential for indirect data leaking through iterative model improvement using user data are highlighted as major concerns. To address this issue, the authors conducted a systematic analysis of OpenAI's GPT-3.5 and GPT-4 models and identified evaluation malpractices in reviewed papers. It is important to note that OpenAI does not use content from their business offerings or API Platform to train their models, therefore only interactions with the models through the web interface were considered for data leakage analysis. In their review process, the authors carefully examined 255 papers by querying multiple academic databases and found no evidence suggesting that any of these papers opted out of providing data for model improvement purposes. They also tracked secondary information relevant to evaluation practices in each work, such as peer-review status, availability of prompts used in experiments, repository for experiment reproducibility, usage of whole dataset or sample, comparisons with other open models/approaches using the same evaluation scale, and reporting which version of GPT-3.5 or GPT-4 was utilized. Their findings revealed that only a minority of papers provided information on the model version used, which is crucial as different versions can yield significantly different outputs. Additionally, they observed that evaluations of ChatGPT's performance were often unfair due to missing comparisons with open-source LLMs or non-LLM-based methods. In conclusion, this refined summary provides a more detailed overview of the authors' analysis on data contamination in NLP research using OpenAI's GPT-3.5 and GPT-4 models. It highlights the lack of information on model versions used in papers and unfair evaluation practices. The authors have made their results available as a collaborative project for other researchers to contribute to addressing these issues.
Created on 12 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.