The Effects of Data Quality on ML-Model Performance

AI-generated keywords: Data Quality Machine Learning AI Applications Performance Reliability

AI-generated Key Points

  • The paper presents a comprehensive experimental study on the correlation between six data quality dimensions and the performance of fifteen machine learning algorithms.
  • The need for high-quality training and test data is crucial for reliable AI applications.
  • This paper is the first systematic study of the effects of data quality dimensions not only for classification but also for clustering and regression tasks.
  • The authors perform a targeted analysis for cases where serving data, training data or both are of low quality.
  • The results show that there is a strong correlation between all six traditional dimensions of data quality and ML-models performance across all three types of tasks (classification, regression, and clustering).
  • All polluters used in their experiments are available online alongside all datasets used in their study which can be easily extended with further quality dimensions or ML models.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Noack, Hendrik Patzlaff, Hazar Harmouch, Felix Naumann

License: CC BY-SA 4.0

Abstract: Modern artificial intelligence (AI) applications require large quantities of training and test data. This need creates critical challenges not only concerning the availability of such data, but also regarding its quality. For example, incomplete, erroneous or inappropriate training data can lead to unreliable models that produce ultimately poor decisions. Trustworthy AI applications require high-quality training and test data along many dimensions, such as accuracy, completeness, consistency, and uniformity. We explore empirically the correlation between six of the traditional data quality dimensions and the performance of fifteen widely used ML algorithms covering the tasks of classification, regression, and clustering, with the goal of explaining ML results in terms of data quality. Our experiments distinguish three scenarios based on the AI pipeline steps that were fed with polluted data: polluted training data, test data, or both. We conclude the paper with an extensive discussion of our observations and recommendations, alongside open questions and future directions to be explored.

Submitted to arXiv on 29 Jul. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2207.14529v1

This paper presents a comprehensive experimental study that investigates the correlation between six data quality dimensions and the performance of fifteen machine learning (ML) algorithms. The need for large quantities of training and test data in modern AI applications creates critical challenges not only concerning the availability of such data, but also regarding its quality. Incomplete, erroneous or inappropriate training data can lead to unreliable models that produce ultimately poor decisions. Trustworthy AI applications require high-quality training and test data along many dimensions, such as accuracy, completeness, consistency, and uniformity. While previous research has studied the effects of label noise and missing values on classification tasks, this paper is the first systematic study of the effects of data quality dimensions not only for classification but also for clustering and regression tasks. The authors perform a targeted analysis for cases where serving data, training data or both are of low quality. The authors present a systematic empirical benchmarking to understand the correlation between data quality and ML-models performance under the umbrella of data-centric AI. They simulate real-life scenarios concerning data in ML-pipelines and provide practical insights and learned lessons for data scientists. Additionally, they raise several questions and point out possible directions for further research. The paper distinguishes three scenarios based on the AI pipeline steps that were fed with polluted data: polluted training data, test data or both. The results of their experiments show that there is a strong correlation between all six traditional dimensions of data quality (accuracy, completeness, consistency, uniformity, timeliness and believability) and ML-models performance across all three types of tasks (classification, regression and clustering). Overall, this work is a first step towards linking ML-models performance to underlying data quality while also understanding their connection. All polluters used in their experiments are available online alongside all datasets used in their study which can be easily extended with further quality dimensions or ML models. This paper provides valuable insights into how to improve the reliability and trustworthiness of AI applications by ensuring high-quality training and test data.
Created on 18 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.