DataCI is an open-source platform that aims to revolutionize data-centric AI in dynamic streaming data settings. It provides a comprehensive infrastructure with rich APIs for seamless streaming dataset management, data-centric pipeline development, and evaluation on streaming scenarios. The platform also offers a carefully designed versioning control function to track the lineage of pipelines and an intuitive graphical interface for an enhanced user experience. To demonstrate the effectiveness and usability of DataCI, two perspectives are considered: user experience investigation and quantitative analysis. In terms of user experience, DataCI prioritizes users' satisfaction by providing a playground where they can interactively try out the system. The playground consists of three sections: data selection from Streaming Data Sink and pre-defined pipelines from Pipeline Registry, manual pipeline launching with visualization through directed acyclic graphs (DAGs), and presentation of experiment running details for reference. For quantitative analysis, a real-world case is simulated using Yelp dataset in a streaming mode. Starting from pipeline version 5 (v5), a new pipeline version 6 (v6) is developed and deployed after passing an A/B test. Subsequent versions are continuously developed using the latest data from Streaming Data Sink. However, version 8 (v8) fails to outperform version 7 (v7). Additionally, it is observed that if v6 is used without frequent updates, online performance drops significantly. This preliminary study highlights the necessity of a system like DataCI for quick building and evaluating data-centric pipelines on streaming data due to frequent changes in data distributions. In conclusion, DataCI addresses the shortcomings of existing tools in streaming data environments by streamlining streaming data management and method deployment through its modular features and intuitive interface. Preliminary studies demonstrate its potential to revolutionize data-centric AI in dynamic contexts. Further exploration is needed to determine upgrade frequency and identify better metrics for measuring pipeline performance in streaming scenarios. Overall, DataCI offers researchers and practitioners an efficient platform for developing and evaluating data-centric AI models in streaming data settings, ultimately advancing the field of data-centric AI.
- - DataCI is an open-source platform for data-centric AI in dynamic streaming data settings
- - It provides infrastructure and APIs for seamless streaming dataset management, pipeline development, and evaluation
- - Versioning control function tracks pipeline lineage and graphical interface enhances user experience
- - User experience investigation includes a playground with data selection, pipeline launching, and experiment details
- - Quantitative analysis simulates a real-world case using Yelp dataset in streaming mode
- - New pipeline versions are continuously developed using the latest data, but version 8 fails to outperform version 7
- - Using older versions without frequent updates leads to significant drops in online performance
- - DataCI addresses shortcomings of existing tools by streamlining streaming data management and method deployment
- - Preliminary studies demonstrate its potential to revolutionize data-centric AI in dynamic contexts
- - Further exploration is needed to determine upgrade frequency and better metrics for measuring pipeline performance in streaming scenarios
- - DataCI offers an efficient platform for developing and evaluating data-centric AI models in streaming data settings
DataCI is a special computer program that helps with using data to make smart decisions. It can handle data that keeps changing all the time. It has tools and ways to manage and organize datasets, create plans for using the data, and check how well the plans work. People who use DataCI can easily see how everything is connected and try different ideas with the data. They can also study real-life examples using a big collection of information from Yelp. Sometimes, new versions of plans are made, but they don't always work better than older ones. If people don't update their plans often, they might not get good results online. DataCI makes it easier to work with streaming data and use it in smart ways."
Definitions- Data-centric: Focusing on or centered around data.
- AI: Artificial Intelligence - Computer systems that can perform tasks that normally require human intelligence.
- Open-source: Software that is freely available for anyone to use, modify, and distribute.
- Streaming: A continuous flow of data or content that is delivered in real-time over the internet.
- Infrastructure: The basic physical or organizational structures needed for an operation or system to function.
- APIs: Application Programming Interfaces - Sets of rules and protocols that allow different software applications to communicate with each other.
- Versioning control: The ability to keep track of different versions or changes made to a software program or project over time.
- Lineage: The history or origin of something, in this case referring to the history
Data-centric artificial intelligence (AI) has become increasingly important in today's fast-paced world, where data is constantly streaming in from various sources. However, developing and deploying AI models on streaming data can be challenging due to the dynamic nature of the data. To address this issue, a team of researchers has developed an open-source platform called DataCI that aims to revolutionize data-centric AI in dynamic streaming data settings.
DataCI provides a comprehensive infrastructure with rich APIs for seamless streaming dataset management, data-centric pipeline development, and evaluation on streaming scenarios. This means that users can easily manage their datasets, develop pipelines specific to their needs, and evaluate them in real-time on streaming data. The platform also offers a carefully designed versioning control function to track the lineage of pipelines and an intuitive graphical interface for an enhanced user experience.
To demonstrate the effectiveness and usability of DataCI, two perspectives are considered: user experience investigation and quantitative analysis. In terms of user experience, DataCI prioritizes users' satisfaction by providing a playground where they can interactively try out the system. The playground consists of three sections:
1) Data selection from Streaming Data Sink - Users can select relevant datasets from the Streaming Data Sink feature.
2) Pre-defined pipelines from Pipeline Registry - Users have access to pre-defined pipelines that they can use as templates or modify according to their needs.
3) Manual pipeline launching with visualization through directed acyclic graphs (DAGs) - Users can manually create their own pipelines using DAGs for better visualization.
In addition to these features, DataCI also presents experiment running details for reference so that users have complete transparency throughout the process.
For quantitative analysis, a real-world case is simulated using Yelp dataset in a streaming mode. Starting from pipeline version 5 (v5), a new pipeline version 6 (v6) is developed and deployed after passing an A/B test. Subsequent versions are continuously developed using the latest data from Streaming Data Sink. However, version 8 (v8) fails to outperform version 7 (v7). This highlights the importance of constantly updating and improving pipelines in dynamic streaming data environments.
Furthermore, it is observed that if v6 is used without frequent updates, online performance drops significantly. This preliminary study emphasizes the need for a platform like DataCI that allows for quick building and evaluating of data-centric pipelines on streaming data due to frequent changes in data distributions.
In conclusion, DataCI addresses the shortcomings of existing tools in streaming data environments by streamlining streaming data management and method deployment through its modular features and intuitive interface. Preliminary studies demonstrate its potential to revolutionize data-centric AI in dynamic contexts. Further exploration is needed to determine upgrade frequency and identify better metrics for measuring pipeline performance in streaming scenarios.
Overall, DataCI offers researchers and practitioners an efficient platform for developing and evaluating data-centric AI models in streaming data settings, ultimately advancing the field of data-centric AI. With its user-friendly interface and powerful features, DataCI has the potential to make a significant impact on how we approach AI development on streaming datasets. As more research is conducted using this platform, we can expect further advancements in the field of data-centric AI.