This paper presents a comprehensive experimental study that investigates the correlation between six data quality dimensions and the performance of fifteen machine learning (ML) algorithms. The need for large quantities of training and test data in modern AI applications creates critical challenges not only concerning the availability of such data, but also regarding its quality. Incomplete, erroneous or inappropriate training data can lead to unreliable models that produce ultimately poor decisions. Trustworthy AI applications require high-quality training and test data along many dimensions, such as accuracy, completeness, consistency, and uniformity. While previous research has studied the effects of label noise and missing values on classification tasks, this paper is the first systematic study of the effects of data quality dimensions not only for classification but also for clustering and regression tasks. The authors perform a targeted analysis for cases where serving data, training data or both are of low quality. The authors present a systematic empirical benchmarking to understand the correlation between data quality and ML-models performance under the umbrella of data-centric AI. They simulate real-life scenarios concerning data in ML-pipelines and provide practical insights and learned lessons for data scientists. Additionally, they raise several questions and point out possible directions for further research. The paper distinguishes three scenarios based on the AI pipeline steps that were fed with polluted data: polluted training data, test data or both. The results of their experiments show that there is a strong correlation between all six traditional dimensions of data quality (accuracy, completeness, consistency, uniformity, timeliness and believability) and ML-models performance across all three types of tasks (classification, regression and clustering). Overall, this work is a first step towards linking ML-models performance to underlying data quality while also understanding their connection. All polluters used in their experiments are available online alongside all datasets used in their study which can be easily extended with further quality dimensions or ML models. This paper provides valuable insights into how to improve the reliability and trustworthiness of AI applications by ensuring high-quality training and test data.
- - The paper presents a comprehensive experimental study on the correlation between six data quality dimensions and the performance of fifteen machine learning algorithms.
- - The need for high-quality training and test data is crucial for reliable AI applications.
- - This paper is the first systematic study of the effects of data quality dimensions not only for classification but also for clustering and regression tasks.
- - The authors perform a targeted analysis for cases where serving data, training data or both are of low quality.
- - The results show that there is a strong correlation between all six traditional dimensions of data quality and ML-models performance across all three types of tasks (classification, regression, and clustering).
- - All polluters used in their experiments are available online alongside all datasets used in their study which can be easily extended with further quality dimensions or ML models.
This paper talks about how important it is to have good data for computers to learn and make good decisions. The authors did experiments to see how different types of bad data affect the computer's performance. They looked at three different types of tasks: sorting things into groups, making predictions, and finding patterns. They found that all six types of bad data they tested made the computer do worse on these tasks. People can use the polluters and datasets from this study to test their own computers.
Definitions- Experimental study: a scientific test where people try different things to see what happens
- Correlation: when two things are related or connected in some way
- Data quality dimensions: different ways that data can be good or bad (such as accuracy or completeness)
- Machine learning algorithms: computer programs that can learn from data and make predictions or decisions based on what they learned
- Classification: sorting things into groups based on their characteristics
- Clustering: finding patterns in data by grouping similar things together
- Regression: making predictions about future events based on past data
Data Quality and Machine Learning Performance: A Comprehensive Experimental Study
AI applications are becoming increasingly popular in many industries, from healthcare to finance. However, the need for large quantities of training and test data creates critical challenges not only concerning the availability of such data, but also regarding its quality. Incomplete, erroneous or inappropriate training data can lead to unreliable models that produce ultimately poor decisions. Trustworthy AI applications require high-quality training and test data along many dimensions, such as accuracy, completeness, consistency, and uniformity.
In this article we discuss a research paper which presents a comprehensive experimental study that investigates the correlation between six data quality dimensions and the performance of fifteen machine learning (ML) algorithms. The authors perform a targeted analysis for cases where serving data, training data or both are of low quality. They present a systematic empirical benchmarking to understand the correlation between data quality and ML-models performance under the umbrella of data-centric AI. Additionally they raise several questions and point out possible directions for further research.
Background
Previous research has studied the effects of label noise and missing values on classification tasks; however this paper is the first systematic study of the effects of all six traditional dimensions (accuracy, completeness, consistency, uniformity timeliness & believability) on not only classification but also clustering & regression tasks when either serving or training datasets are polluted with low quality information.
Experimental Setup
The authors distinguish three scenarios based on which step in an AI pipeline was fed with polluted data: polluted training dataset(s), test dataset(s) or both datasets were contaminated with low quality information at different levels across all six traditional dimensions (accuracy etc). All polluters used in their experiments are available online alongside all datasets used in their study which can be easily extended with further quality dimensions or ML models.
Results & Findings
The results show that there is indeed a strong correlation between all six traditional dimensions of data quality (accuracy etc.)and ML-models performance across all three types of tasks (classification etc.). This work provides valuable insights into how to improve reliability & trustworthiness by ensuring high-quality training & test datasets while understanding their connection to ML model performance overall - providing practical lessons for Data Scientists working within this field today!
Conclusion
This paper provides useful insights into how improving underlying dataset qualities can result in better performing ML models - something especially important given current trends towards increased use/reliance upon AI technologies across multiple industries today! Furthermore it raises several questions & points out possible directions for further research - making it an invaluable resource for those interested in exploring this area more deeply!