Data Governance in the Age of Large-Scale Data-Driven Language Technology

AI-generated keywords: Data Governance Machine Learning Language Data Wikimedia Project DSO Model

AI-generated Key Points

  • Machine Learning technology, particularly Large Language Models, has brought attention to the need for systematic and transparent management of language data.
  • The authors propose a global language data governance approach that organizes data management among stakeholders, values, and rights.
  • The framework presented is a multi-party international governance structure focused on language data, incorporating technical and organizational tools.
  • There is tension between the goals of reproducible research and the need to update datasets for personal information removal.
  • The example of distributed data governance in the Wikimedia Project offers valuable insights into collaborative and self-regulated data curation.
  • Core stakeholders in Wikimedia projects align with those in the proposed governance structure.
  • Challenges faced by Wikimedia projects parallel those that would arise in global digital language data governance.
  • Content licenses play a role in governing Wikimedia data but may conflict with cultural values or concerns about exploitation.
  • Editors enforce policies that evolve through contestation to ensure adherence to chosen licenses and regulations.
  • A new model called Data Stewardship Organization (DSO) is proposed to empower data subjects and involve them in decisions regarding their data use.
  • The DSO facilitates collaboration among multiple stakeholders to build and manage language resources responsibly.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yacine Jernite, Huu Nguyen, Stella Biderman, Anna Rogers, Maraim Masoud, Valentin Danchev, Samson Tan, Alexandra Sasha Luccioni, Nishant Subramani, Gérard Dupont, Jesse Dodge, Kyle Lo, Zeerak Talat, Isaac Johnson, Dragomir Radev, Somaieh Nikpoor, Jörg Frohberg, Aaron Gokaslan, Peter Henderson, Rishi Bommasani, Margaret Mitchell

Proceedings of FAccT 2022. ACM, New York, NY, USA
32 pages: Full paper and Appendices
License: CC BY 4.0

Abstract: The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distributed governance that accounts for human values and grounded by an international research collaboration that brings together researchers and practitioners from 60 countries. The framework we present is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.

Submitted to arXiv on 04 May. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2206.03216v1

The recent emergence and adoption of Machine Learning technology, particularly Large Language Models, has highlighted the importance of systematic and transparent management of language data. In response to this need, the authors propose a global language data governance approach that aims to organize data management among stakeholders, values, and rights. This proposal is informed by prior work on distributed governance and is supported by an international research collaboration involving researchers and practitioners from 60 countries. The framework presented in this work is a multi-party international governance structure focused specifically on language data. It incorporates both technical and organizational tools necessary to support its work. The authors emphasize the tension between the goals of reproducible research which require public recording of datasets and the need to update datasets to accommodate requests for personal information removal. They also highlight the potential circulation of unredacted copies of datasets for extended periods. To provide further context, the authors discuss the example of distributed data governance in the Wikimedia Project. This project offers valuable insights into highly collaborative and self-regulated data curation, similar to the goals proposed in their governance structure. The core stakeholders in Wikimedia projects align with those in Figure 1 of their proposed governance structure: contributors (data rights holders), editors (data custodians), Wikimedia Foundation (data stewards and helpers), researchers, digital platforms, and end-users (data modelers). The challenges faced by Wikimedia projects such as diverse editor needs and goals navigating local laws and addressing power imbalances parallel those that would arise in global digital language data governance. The authors highlight how content licenses play a role in governing Wikimedia data but may conflict with cultural values or concerns about exploitation. Editors enforce policies that constantly evolve through contestation to ensure adherence to chosen licenses and regulations. Similar to their proposed framework success within the Wikimedia editor community relies on a wide range of tools facilitating data governance at scale. In light of identified needs for a comprehensive governance structure outlined throughout previous sections, the authors propose a new model called Data Stewardship Organization (DSO). This organization aims to empower data subjects and rights holders by involving them in decisions regarding the use of their data. The DSO facilitates collaboration among multiple stakeholders to build and manage language resources responsibly supporting the development of data-driven language technology. The authors acknowledge that coordination across stakeholders remains a significant challenge.
Created on 30 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.