Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT

AI-generated keywords: Large Language Models Persian language ChatGPT benchmarking analysis GPT-3.5-turbo

AI-generated Key Points

  • Large language models (LLMs) like ChatGPT and GPT-3.5-turbo have shown impressive performance in English but their effectiveness in low-resource languages like Persian is uncertain.
  • A comprehensive benchmarking analysis of LLMs, including GPT-3.5-turbo, GPT-4, and OpenChat-3.5, was conducted across various tasks in Persian.
  • The study introduced new benchmarks for reasoning tasks in Persian based on elementary school math questions and entrance exams for 7th and 10th grades due to the scarcity of Persian datasets.
  • LLMs often fall short compared to smaller pre-trained models fine-tuned for specific tasks in Persian.
  • Improved performance was observed when test sets were translated to English before being inputted into GPT-3.5, suggesting potential enhancements in LLM performance for the Persian language context.
  • Leveraging LLMs shows promise for enhancing natural language processing capabilities in Persian based on the study's findings.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Amirhossein Abaskohi, Sara Baruni, Mostafa Masoudi, Nesa Abbasi, Mohammad Hadi Babalou, Ali Edalat, Sepehr Kamahi, Samin Mahdizadeh Sani, Nikoo Naghavian, Danial Namazifard, Pouya Sadeghi, Yadollah Yaghoobzadeh

14 pages, 1 figure, 6 tables, Proceeding of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)
License: CC BY 4.0

Abstract: This paper explores the efficacy of large language models (LLMs) for Persian. While ChatGPT and consequent LLMs have shown remarkable performance in English, their efficiency for more low-resource languages remains an open question. We present the first comprehensive benchmarking study of LLMs across diverse Persian language tasks. Our primary focus is on GPT-3.5-turbo, but we also include GPT-4 and OpenChat-3.5 to provide a more holistic evaluation. Our assessment encompasses a diverse set of tasks categorized into classic, reasoning, and knowledge-based domains. To enable a thorough comparison, we evaluate LLMs against existing task-specific fine-tuned models. Given the limited availability of Persian datasets for reasoning tasks, we introduce two new benchmarks: one based on elementary school math questions and another derived from the entrance exams for 7th and 10th grades. Our findings reveal that while LLMs, especially GPT-4, excel in tasks requiring reasoning abilities and a broad understanding of general knowledge, they often lag behind smaller pre-trained models fine-tuned specifically for particular tasks. Additionally, we observe improved performance when test sets are translated to English before inputting them into GPT-3.5. These results highlight the significant potential for enhancing LLM performance in the Persian language. This is particularly noteworthy due to the unique attributes of Persian, including its distinct alphabet and writing styles.

Submitted to arXiv on 03 Apr. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2404.02403v1

This paper delves into the effectiveness of large language models (LLMs) for the Persian language, with a specific focus on ChatGPT and other subsequent LLMs. While these models have showcased impressive performance in English, their applicability to low-resource languages like Persian remains uncertain. To address this gap, the study conducts a comprehensive benchmarking analysis of LLMs across various tasks in Persian. The primary model under scrutiny is GPT-3.5-turbo, although evaluations also encompass GPT-4 and OpenChat-3.5 to provide a more holistic assessment. The research covers a wide range of tasks categorized into classic, reasoning, and knowledge-based domains. In order to facilitate a thorough comparison, LLMs are pitted against task-specific fine-tuned models that already exist. Given the scarcity of Persian datasets for reasoning tasks, the study introduces two new benchmarks: one based on elementary school math questions and another derived from entrance exams for 7th and 10th grades. The findings indicate that while LLMs , they often fall short compared to smaller pre-trained models fine-tuned for specific tasks. Furthermore, an interesting observation is made regarding improved performance when test sets are translated to English before being inputted into GPT-3.5. This suggests potential enhancements in LLM performance for the Persian language context. Notably, this holds significance due to . In light of these results, it becomes evident that there is significant promise in leveraging LLMs for enhancing natural language processing capabilities in Persian. This study contributes valuable insights towards understanding .
Created on 02 Feb. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.