Large Language Models (LLMs) are increasingly being used to generate synthetic data for training and evaluating models, including question answering (QA) datasets. However, the effectiveness of LLMs in generating culturally relevant commonsense QA datasets for low-resource languages remains unclear. In this study, we investigate the use of LLMs in creating QA datasets for Indonesian and Sundanese languages. We employ a multiple-choice question format inspired by the English CommonsenseQA data format to facilitate evaluation. Our experiments reveal that while the best-performing LLM, GPT-4 Turbo, can generate questions with adequate knowledge in Indonesian, its performance is lacking for Sundanese, underscoring the discrepancy between medium- and lower-resource languages. Additionally, we benchmark various LLMs on our generated datasets and find that they perform better on LLM-generated datasets compared to human-created ones. To ensure ethical considerations, all human-generated datasets have been manually validated to exclude harmful or offensive content. Harmful questions in LLM-generated datasets were filtered out automatically. The study has undergone review by the Institutional Review Board (IRB), and annotators were compensated above minimum wage. The datasets will be publicly available under a Creative Commons Non-Commercial license. In analyzing the quality of LLM-generated data, we reviewed samples from both adapted and generated sets for accuracy of concepts, questions, and options. Concept adaptation from English to Indonesian showed high accuracy rates but dropped for Indonesian to Sundanese translations due to weaker machine translation performance. Despite accurate adaptations, concept variation was skewed towards specific entities within categories, indicating a need for manual development across diverse categories. Furthermore, question analysis revealed varying quality in generated questions based on strict criteria for correctness. The dataset's quality improvement requires manual development of concepts across categories to enhance concept diversity and coverage. The study also extends its focus beyond Indonesian to include Sundanese as a local language of Indonesia. This expansion highlights the importance of addressing linguistic diversity within research efforts involving low-resource languages. Overall, this research serves as a valuable starting point for exploring culturally relevant commonsense QA datasets using LLMs in diverse linguistic contexts.
- - Large Language Models (LLMs) are used to generate synthetic data for training and evaluating models, including question answering datasets.
- - Effectiveness of LLMs in creating culturally relevant commonsense QA datasets for low-resource languages is unclear.
- - Study investigates use of LLMs in creating QA datasets for Indonesian and Sundanese languages with multiple-choice question format inspired by English CommonsenseQA data format.
- - Best-performing LLM, GPT-4 Turbo, can generate questions with adequate knowledge in Indonesian but lacks performance for Sundanese.
- - Various LLMs perform better on LLM-generated datasets compared to human-created ones.
- - Ethical considerations include manual validation of human-generated datasets and automatic filtering of harmful questions in LLM-generated datasets.
- - Datasets will be publicly available under a Creative Commons Non-Commercial license.
- - Quality analysis shows high accuracy rates for concept adaptation from English to Indonesian but drops for Indonesian to Sundanese translations due to weaker machine translation performance.
- - Question analysis reveals varying quality in generated questions based on strict correctness criteria, indicating a need for manual development across diverse categories.
- - Study extends focus beyond Indonesian to include Sundanese as a local language of Indonesia, emphasizing the importance of addressing linguistic diversity within research efforts involving low-resource languages.
Summary- Big computer programs are used to make up pretend information for teaching and testing other programs, like answering questions.
- We're not sure how good these big programs are at making up common-sense questions in languages that don't have many resources.
- A study is looking at how well these big programs can make question sets in Indonesian and Sundanese languages, inspired by English question sets.
- The best program, GPT-4 Turbo, is okay at making questions in Indonesian but not so good in Sundanese.
- Some of these big programs do better on the made-up data than on real human-made data.
Definitions- Large Language Models (LLMs): Big computer programs that create fake information for training and testing other programs.
- Culturally relevant: Information that makes sense and fits well with a particular culture or group of people.
- Commonsense QA datasets: Collections of questions and answers based on everyday knowledge and understanding.
- Low-resource languages: Languages that don't have as many tools or support available compared to more widely spoken languages.
Large language models (LLMs) have become increasingly popular in recent years for their ability to generate synthetic data for training and evaluating models. This includes question answering (QA) datasets, which are essential for advancing natural language processing (NLP) research. However, the effectiveness of LLMs in generating culturally relevant commonsense QA datasets for low-resource languages remains unclear.
In response to this gap in knowledge, a team of researchers conducted a study investigating the use of LLMs in creating QA datasets for Indonesian and Sundanese languages. The study employed a multiple-choice question format inspired by the English CommonsenseQA data format to facilitate evaluation.
The experiments revealed that while the best-performing LLM, GPT-4 Turbo, can generate questions with adequate knowledge in Indonesian, its performance is lacking for Sundanese. This highlights the discrepancy between medium- and lower-resource languages when it comes to utilizing LLMs for generating QA datasets.
To ensure ethical considerations were met during this research, all human-generated datasets were manually validated to exclude harmful or offensive content. Harmful questions in LLM-generated datasets were filtered out automatically. The study also underwent review by an Institutional Review Board (IRB), and annotators were compensated above minimum wage.
The resulting datasets will be publicly available under a Creative Commons Non-Commercial license. This allows other researchers to access and utilize them without any commercial restrictions.
In analyzing the quality of LLM-generated data, the researchers reviewed samples from both adapted and generated sets for accuracy of concepts, questions, and options. Concept adaptation from English to Indonesian showed high accuracy rates but dropped for Indonesian to Sundanese translations due to weaker machine translation performance.
Despite accurate adaptations, concept variation was skewed towards specific entities within categories. This indicates a need for manual development across diverse categories to enhance concept diversity and coverage within these low-resource languages.
Furthermore, question analysis revealed varying quality in generated questions based on strict criteria for correctness. This highlights the need for further manual development of concepts across categories to improve the overall quality of the dataset.
One notable aspect of this research is its focus on not just one, but two low-resource languages: Indonesian and Sundanese. This expansion highlights the importance of addressing linguistic diversity within research efforts involving low-resource languages. It also serves as a valuable starting point for exploring culturally relevant commonsense QA datasets using LLMs in diverse linguistic contexts.
In conclusion, this study sheds light on the potential and limitations of using LLMs to generate QA datasets for low-resource languages. It also emphasizes the importance of considering ethical considerations and linguistic diversity in NLP research. The resulting datasets will be a valuable resource for future studies in this area, and it is hoped that they will contribute to advancements in natural language understanding across various languages and cultures.