DataCI: A Platform for Data-Centric AI on Streaming Data

AI-generated keywords: DataCI open-source platform data-centric AI streaming data pipeline development

AI-generated Key Points

  • DataCI is an open-source platform for data-centric AI in dynamic streaming data settings
  • It provides infrastructure and APIs for seamless streaming dataset management, pipeline development, and evaluation
  • Versioning control function tracks pipeline lineage and graphical interface enhances user experience
  • User experience investigation includes a playground with data selection, pipeline launching, and experiment details
  • Quantitative analysis simulates a real-world case using Yelp dataset in streaming mode
  • New pipeline versions are continuously developed using the latest data, but version 8 fails to outperform version 7
  • Using older versions without frequent updates leads to significant drops in online performance
  • DataCI addresses shortcomings of existing tools by streamlining streaming data management and method deployment
  • Preliminary studies demonstrate its potential to revolutionize data-centric AI in dynamic contexts
  • Further exploration is needed to determine upgrade frequency and better metrics for measuring pipeline performance in streaming scenarios
  • DataCI offers an efficient platform for developing and evaluating data-centric AI models in streaming data settings
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Huaizheng Zhang, Yizheng Huang, Yuanming Li

3 pages, 4 figures
License: CC BY 4.0

Abstract: We introduce DataCI, a comprehensive open-source platform designed specifically for data-centric AI in dynamic streaming data settings. DataCI provides 1) an infrastructure with rich APIs for seamless streaming dataset management, data-centric pipeline development and evaluation on streaming scenarios, 2) an carefully designed versioning control function to track the pipeline lineage, and 3) an intuitive graphical interface for a better interactive user experience. Preliminary studies and demonstrations attest to the easy-to-use and effectiveness of DataCI, highlighting its potential to revolutionize the practice of data-centric AI in streaming data contexts.

Submitted to arXiv on 27 Jun. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2306.15538v2

DataCI is an open-source platform that aims to revolutionize data-centric AI in dynamic streaming data settings. It provides a comprehensive infrastructure with rich APIs for seamless streaming dataset management, data-centric pipeline development, and evaluation on streaming scenarios. The platform also offers a carefully designed versioning control function to track the lineage of pipelines and an intuitive graphical interface for an enhanced user experience. To demonstrate the effectiveness and usability of DataCI, two perspectives are considered: user experience investigation and quantitative analysis. In terms of user experience, DataCI prioritizes users' satisfaction by providing a playground where they can interactively try out the system. The playground consists of three sections: data selection from Streaming Data Sink and pre-defined pipelines from Pipeline Registry, manual pipeline launching with visualization through directed acyclic graphs (DAGs), and presentation of experiment running details for reference. For quantitative analysis, a real-world case is simulated using Yelp dataset in a streaming mode. Starting from pipeline version 5 (v5), a new pipeline version 6 (v6) is developed and deployed after passing an A/B test. Subsequent versions are continuously developed using the latest data from Streaming Data Sink. However, version 8 (v8) fails to outperform version 7 (v7). Additionally, it is observed that if v6 is used without frequent updates, online performance drops significantly. This preliminary study highlights the necessity of a system like DataCI for quick building and evaluating data-centric pipelines on streaming data due to frequent changes in data distributions. In conclusion, DataCI addresses the shortcomings of existing tools in streaming data environments by streamlining streaming data management and method deployment through its modular features and intuitive interface. Preliminary studies demonstrate its potential to revolutionize data-centric AI in dynamic contexts. Further exploration is needed to determine upgrade frequency and identify better metrics for measuring pipeline performance in streaming scenarios. Overall, DataCI offers researchers and practitioners an efficient platform for developing and evaluating data-centric AI models in streaming data settings, ultimately advancing the field of data-centric AI.
Created on 26 Jan. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.