Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese

AI-generated keywords: Large Language Models Synthetic Data Question Answering Low-Resource Languages Culturally Relevant

AI-generated Key Points

  • Large Language Models (LLMs) are used to generate synthetic data for training and evaluating models, including question answering datasets.
  • Effectiveness of LLMs in creating culturally relevant commonsense QA datasets for low-resource languages is unclear.
  • Study investigates use of LLMs in creating QA datasets for Indonesian and Sundanese languages with multiple-choice question format inspired by English CommonsenseQA data format.
  • Best-performing LLM, GPT-4 Turbo, can generate questions with adequate knowledge in Indonesian but lacks performance for Sundanese.
  • Various LLMs perform better on LLM-generated datasets compared to human-created ones.
  • Ethical considerations include manual validation of human-generated datasets and automatic filtering of harmful questions in LLM-generated datasets.
  • Datasets will be publicly available under a Creative Commons Non-Commercial license.
  • Quality analysis shows high accuracy rates for concept adaptation from English to Indonesian but drops for Indonesian to Sundanese translations due to weaker machine translation performance.
  • Question analysis reveals varying quality in generated questions based on strict correctness criteria, indicating a need for manual development across diverse categories.
  • Study extends focus beyond Indonesian to include Sundanese as a local language of Indonesia, emphasizing the importance of addressing linguistic diversity within research efforts involving low-resource languages.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Rifki Afina Putri, Faiz Ghifari Haznitrama, Dea Adhista, Alice Oh

License: CC BY 4.0

Abstract: Large Language Models (LLMs) are increasingly being used to generate synthetic data for training and evaluating models. However, it is unclear whether they can generate a good quality of question answering (QA) dataset that incorporates knowledge and cultural nuance embedded in a language, especially for low-resource languages. In this study, we investigate the effectiveness of using LLMs in generating culturally relevant commonsense QA datasets for Indonesian and Sundanese languages. To do so, we create datasets for these languages using various methods involving both LLMs and human annotators. Our experiments show that the current best-performing LLM, GPT-4 Turbo, is capable of generating questions with adequate knowledge in Indonesian but not in Sundanese, highlighting the performance discrepancy between medium- and lower-resource languages. We also benchmark various LLMs on our generated datasets and find that they perform better on the LLM-generated datasets compared to those created by humans.

Submitted to arXiv on 27 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.17302v1

Large Language Models (LLMs) are increasingly being used to generate synthetic data for training and evaluating models, including question answering (QA) datasets. However, the effectiveness of LLMs in generating culturally relevant commonsense QA datasets for low-resource languages remains unclear. In this study, we investigate the use of LLMs in creating QA datasets for Indonesian and Sundanese languages. We employ a multiple-choice question format inspired by the English CommonsenseQA data format to facilitate evaluation. Our experiments reveal that while the best-performing LLM, GPT-4 Turbo, can generate questions with adequate knowledge in Indonesian, its performance is lacking for Sundanese, underscoring the discrepancy between medium- and lower-resource languages. Additionally, we benchmark various LLMs on our generated datasets and find that they perform better on LLM-generated datasets compared to human-created ones. To ensure ethical considerations, all human-generated datasets have been manually validated to exclude harmful or offensive content. Harmful questions in LLM-generated datasets were filtered out automatically. The study has undergone review by the Institutional Review Board (IRB), and annotators were compensated above minimum wage. The datasets will be publicly available under a Creative Commons Non-Commercial license. In analyzing the quality of LLM-generated data, we reviewed samples from both adapted and generated sets for accuracy of concepts, questions, and options. Concept adaptation from English to Indonesian showed high accuracy rates but dropped for Indonesian to Sundanese translations due to weaker machine translation performance. Despite accurate adaptations, concept variation was skewed towards specific entities within categories, indicating a need for manual development across diverse categories. Furthermore, question analysis revealed varying quality in generated questions based on strict criteria for correctness. The dataset's quality improvement requires manual development of concepts across categories to enhance concept diversity and coverage. The study also extends its focus beyond Indonesian to include Sundanese as a local language of Indonesia. This expansion highlights the importance of addressing linguistic diversity within research efforts involving low-resource languages. Overall, this research serves as a valuable starting point for exploring culturally relevant commonsense QA datasets using LLMs in diverse linguistic contexts.
Created on 04 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.