Data Governance in the Age of Large-Scale Data-Driven Language Technology

AI-generated keywords: Data Governance Machine Learning Language Data Wikimedia Project DSO Model

AI-generated Key Points

Machine Learning technology, particularly Large Language Models, has brought attention to the need for systematic and transparent management of language data.
The authors propose a global language data governance approach that organizes data management among stakeholders, values, and rights.
The framework presented is a multi-party international governance structure focused on language data, incorporating technical and organizational tools.
There is tension between the goals of reproducible research and the need to update datasets for personal information removal.
The example of distributed data governance in the Wikimedia Project offers valuable insights into collaborative and self-regulated data curation.
Core stakeholders in Wikimedia projects align with those in the proposed governance structure.
Challenges faced by Wikimedia projects parallel those that would arise in global digital language data governance.
Content licenses play a role in governing Wikimedia data but may conflict with cultural values or concerns about exploitation.
Editors enforce policies that evolve through contestation to ensure adherence to chosen licenses and regulations.
A new model called Data Stewardship Organization (DSO) is proposed to empower data subjects and involve them in decisions regarding their data use.
The DSO facilitates collaboration among multiple stakeholders to build and manage language resources responsibly.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yacine Jernite, Huu Nguyen, Stella Biderman, Anna Rogers, Maraim Masoud, Valentin Danchev, Samson Tan, Alexandra Sasha Luccioni, Nishant Subramani, Gérard Dupont, Jesse Dodge, Kyle Lo, Zeerak Talat, Isaac Johnson, Dragomir Radev, Somaieh Nikpoor, Jörg Frohberg, Aaron Gokaslan, Peter Henderson, Rishi Bommasani, Margaret Mitchell

Proceedings of FAccT 2022. ACM, New York, NY, USA

arXiv: 2206.03216v1 - DOI (cs.CY)

32 pages: Full paper and Appendices

License: CC BY 4.0

Abstract: The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distributed governance that accounts for human values and grounded by an international research collaboration that brings together researchers and practitioners from 60 countries. The framework we present is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.

Submitted to arXiv on 04 May. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2206.03216v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

The recent emergence and adoption of Machine Learning technology, particularly Large Language Models, has highlighted the importance of systematic and transparent management of language data. In response to this need, the authors propose a global language data governance approach that aims to organize data management among stakeholders, values, and rights. This proposal is informed by prior work on distributed governance and is supported by an international research collaboration involving researchers and practitioners from 60 countries. The framework presented in this work is a multi-party international governance structure focused specifically on language data. It incorporates both technical and organizational tools necessary to support its work. The authors emphasize the tension between the goals of reproducible research which require public recording of datasets and the need to update datasets to accommodate requests for personal information removal. They also highlight the potential circulation of unredacted copies of datasets for extended periods. To provide further context, the authors discuss the example of distributed data governance in the Wikimedia Project. This project offers valuable insights into highly collaborative and self-regulated data curation, similar to the goals proposed in their governance structure. The core stakeholders in Wikimedia projects align with those in Figure 1 of their proposed governance structure: contributors (data rights holders), editors (data custodians), Wikimedia Foundation (data stewards and helpers), researchers, digital platforms, and end-users (data modelers). The challenges faced by Wikimedia projects such as diverse editor needs and goals navigating local laws and addressing power imbalances parallel those that would arise in global digital language data governance. The authors highlight how content licenses play a role in governing Wikimedia data but may conflict with cultural values or concerns about exploitation. Editors enforce policies that constantly evolve through contestation to ensure adherence to chosen licenses and regulations. Similar to their proposed framework success within the Wikimedia editor community relies on a wide range of tools facilitating data governance at scale. In light of identified needs for a comprehensive governance structure outlined throughout previous sections, the authors propose a new model called Data Stewardship Organization (DSO). This organization aims to empower data subjects and rights holders by involving them in decisions regarding the use of their data. The DSO facilitates collaboration among multiple stakeholders to build and manage language resources responsibly supporting the development of data-driven language technology. The authors acknowledge that coordination across stakeholders remains a significant challenge.

- Machine Learning technology, particularly Large Language Models, has brought attention to the need for systematic and transparent management of language data.
- The authors propose a global language data governance approach that organizes data management among stakeholders, values, and rights.
- The framework presented is a multi-party international governance structure focused on language data, incorporating technical and organizational tools.
- There is tension between the goals of reproducible research and the need to update datasets for personal information removal.
- The example of distributed data governance in the Wikimedia Project offers valuable insights into collaborative and self-regulated data curation.
- Core stakeholders in Wikimedia projects align with those in the proposed governance structure.
- Challenges faced by Wikimedia projects parallel those that would arise in global digital language data governance.
- Content licenses play a role in governing Wikimedia data but may conflict with cultural values or concerns about exploitation.
- Editors enforce policies that evolve through contestation to ensure adherence to chosen licenses and regulations.
- A new model called Data Stewardship Organization (DSO) is proposed to empower data subjects and involve them in decisions regarding their data use.
- The DSO facilitates collaboration among multiple stakeholders to build and manage language resources responsibly.

Machine Learning technology is a way for computers to learn and understand language. It's important to manage language data in a systematic and transparent way. The authors suggest a plan for how different people can work together to manage language data. Sometimes it's hard to balance the need for accurate research with protecting people's personal information. The Wikimedia Project is an example of how people can work together to manage data.

The Need for Global Language Data Governance

In recent years, the emergence and adoption of Machine Learning technology has highlighted the importance of systematic and transparent management of language data. To address this need, a research collaboration involving researchers and practitioners from 60 countries have proposed a global language data governance approach that aims to organize data management among stakeholders, values, and rights. This framework is intended to provide an international governance structure focused specifically on language data that incorporates both technical and organizational tools necessary to support its work.

Challenges in Data Governance

The authors emphasize the tension between the goals of reproducible research which require public recording of datasets and the need to update datasets to accommodate requests for personal information removal. They also highlight the potential circulation of unredacted copies of datasets for extended periods. To provide further context, they discuss how distributed data governance works in Wikimedia projects such as Wikipedia. The core stakeholders in these projects align with those outlined in Figure 1 of their proposed governance structure: contributors (data rights holders), editors (data custodians), Wikimedia Foundation (data stewards and helpers), researchers, digital platforms, and end-users (data modelers). The challenges faced by Wikimedia projects such as diverse editor needs navigating local laws or addressing power imbalances parallel those that would arise in global digital language data governance. Content licenses play a role in governing Wikimedia data but may conflict with cultural values or concerns about exploitation; editors must enforce policies that constantly evolve through contestation to ensure adherence to chosen licenses and regulations.

Data Stewardship Organization Proposal

In light of identified needs for a comprehensive governance structure outlined throughout previous sections, the authors propose a new model called Data Stewardship Organization (DSO). This organization aims to empower data subjects and rights holders by involving them in decisions regarding the use of their data. The DSO facilitates collaboration among multiple stakeholders to build and manage language resources responsibly supporting the development of data-driven language technology. The authors acknowledge that coordination across stakeholders remains a significant challenge yet believe this model provides an effective way forward towards responsible management of language resources globally.

Conclusion

The recent emergence and adoption Machine Learning technology has highlighted an urgent need for systematic management over large amounts language date globally - something which traditional methods have been unable meet adequately due complexity involved managing different stakeholder interests at scale internationally . In response this need , researchers from 60 countries have proposed global language date governance approach incorporating both technical organizational tools necessary support its work . This proposal includes establishment Data Stewardship Organization empowered involve subjects right holders decision making process . While there remain significant challenges coordinating across multiple stakeholders , it is believed this model will help facilitate responsible management large amounts global languages resource moving forward .

Created on 30 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

62.8%

Artificial Intelligence Technologies in Education: Benefits, Challenges and S…

cs.CY

55.6%

FATE in AI: Towards Algorithmic Inclusivity and Accessibility

cs.CY

55.6%

Putting AI Ethics into Practice: The Hourglass Model of Organizational AI Gov…

cs.AI

54.8%

Impact of Business Analytics and Decision Support Systems on e-commerce in SM…

econ.GN

54.5%

AI and Ethics -- Operationalising Responsible AI

cs.AI

53.5%

The "Collections as ML Data" Checklist for Machine Learning & Cultural Herita…

cs.LG

53.4%

Libraries, Integrations and Hubs for Decentralized AI using IPFS

cs.NI

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.