The recent emergence and adoption of Machine Learning technology, particularly Large Language Models, has highlighted the importance of systematic and transparent management of language data. In response to this need, the authors propose a global language data governance approach that aims to organize data management among stakeholders, values, and rights. This proposal is informed by prior work on distributed governance and is supported by an international research collaboration involving researchers and practitioners from 60 countries. The framework presented in this work is a multi-party international governance structure focused specifically on language data. It incorporates both technical and organizational tools necessary to support its work. The authors emphasize the tension between the goals of reproducible research which require public recording of datasets and the need to update datasets to accommodate requests for personal information removal. They also highlight the potential circulation of unredacted copies of datasets for extended periods. To provide further context, the authors discuss the example of distributed data governance in the Wikimedia Project. This project offers valuable insights into highly collaborative and self-regulated data curation, similar to the goals proposed in their governance structure. The core stakeholders in Wikimedia projects align with those in Figure 1 of their proposed governance structure: contributors (data rights holders), editors (data custodians), Wikimedia Foundation (data stewards and helpers), researchers, digital platforms, and end-users (data modelers). The challenges faced by Wikimedia projects such as diverse editor needs and goals navigating local laws and addressing power imbalances parallel those that would arise in global digital language data governance. The authors highlight how content licenses play a role in governing Wikimedia data but may conflict with cultural values or concerns about exploitation. Editors enforce policies that constantly evolve through contestation to ensure adherence to chosen licenses and regulations. Similar to their proposed framework success within the Wikimedia editor community relies on a wide range of tools facilitating data governance at scale. In light of identified needs for a comprehensive governance structure outlined throughout previous sections, the authors propose a new model called Data Stewardship Organization (DSO). This organization aims to empower data subjects and rights holders by involving them in decisions regarding the use of their data. The DSO facilitates collaboration among multiple stakeholders to build and manage language resources responsibly supporting the development of data-driven language technology. The authors acknowledge that coordination across stakeholders remains a significant challenge.
- - Machine Learning technology, particularly Large Language Models, has brought attention to the need for systematic and transparent management of language data.
- - The authors propose a global language data governance approach that organizes data management among stakeholders, values, and rights.
- - The framework presented is a multi-party international governance structure focused on language data, incorporating technical and organizational tools.
- - There is tension between the goals of reproducible research and the need to update datasets for personal information removal.
- - The example of distributed data governance in the Wikimedia Project offers valuable insights into collaborative and self-regulated data curation.
- - Core stakeholders in Wikimedia projects align with those in the proposed governance structure.
- - Challenges faced by Wikimedia projects parallel those that would arise in global digital language data governance.
- - Content licenses play a role in governing Wikimedia data but may conflict with cultural values or concerns about exploitation.
- - Editors enforce policies that evolve through contestation to ensure adherence to chosen licenses and regulations.
- - A new model called Data Stewardship Organization (DSO) is proposed to empower data subjects and involve them in decisions regarding their data use.
- - The DSO facilitates collaboration among multiple stakeholders to build and manage language resources responsibly.
Machine Learning technology is a way for computers to learn and understand language. It's important to manage language data in a systematic and transparent way. The authors suggest a plan for how different people can work together to manage language data. Sometimes it's hard to balance the need for accurate research with protecting people's personal information. The Wikimedia Project is an example of how people can work together to manage data.
The Need for Global Language Data Governance
In recent years, the emergence and adoption of Machine Learning technology has highlighted the importance of systematic and transparent management of language data. To address this need, a research collaboration involving researchers and practitioners from 60 countries have proposed a global language data governance approach that aims to organize data management among stakeholders, values, and rights. This framework is intended to provide an international governance structure focused specifically on language data that incorporates both technical and organizational tools necessary to support its work.
Challenges in Data Governance
The authors emphasize the tension between the goals of reproducible research which require public recording of datasets and the need to update datasets to accommodate requests for personal information removal. They also highlight the potential circulation of unredacted copies of datasets for extended periods. To provide further context, they discuss how distributed data governance works in Wikimedia projects such as Wikipedia. The core stakeholders in these projects align with those outlined in Figure 1 of their proposed governance structure: contributors (data rights holders), editors (data custodians), Wikimedia Foundation (data stewards and helpers), researchers, digital platforms, and end-users (data modelers). The challenges faced by Wikimedia projects such as diverse editor needs navigating local laws or addressing power imbalances parallel those that would arise in global digital language data governance. Content licenses play a role in governing Wikimedia data but may conflict with cultural values or concerns about exploitation; editors must enforce policies that constantly evolve through contestation to ensure adherence to chosen licenses and regulations.
Data Stewardship Organization Proposal
In light of identified needs for a comprehensive governance structure outlined throughout previous sections, the authors propose a new model called Data Stewardship Organization (DSO). This organization aims to empower data subjects and rights holders by involving them in decisions regarding the use of their data. The DSO facilitates collaboration among multiple stakeholders to build and manage language resources responsibly supporting the development of data-driven language technology. The authors acknowledge that coordination across stakeholders remains a significant challenge yet believe this model provides an effective way forward towards responsible management of language resources globally.
Conclusion
The recent emergence and adoption Machine Learning technology has highlighted an urgent need for systematic management over large amounts language date globally - something which traditional methods have been unable meet adequately due complexity involved managing different stakeholder interests at scale internationally . In response this need , researchers from 60 countries have proposed global language date governance approach incorporating both technical organizational tools necessary support its work . This proposal includes establishment Data Stewardship Organization empowered involve subjects right holders decision making process . While there remain significant challenges coordinating across multiple stakeholders , it is believed this model will help facilitate responsible management large amounts global languages resource moving forward .