Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models

AI-generated keywords: Large Language Models Safety Concerns Chinese SafetyQA Benchmark LLM Factuality Compliance and Legality Evaluations

AI-generated Key Points

Concerns surrounding safety in Large Language Models (LLMs) are increasingly prominent, especially in domains like law, policy, and ethics.
The Chinese SafetyQA benchmark was introduced to assess LLMs' factuality abilities in answering short questions related to safety.
Analysis of the Chinese Safety Domain dataset showed that o1-preview performed the best overall, while the gpt-4o-mini model had the lowest performance.
GPT models excelled in Physical & Mental Health topics but struggled in Illegal & Regulatory Compliance topics compared to specialized Chinese models like Qwen-series and Doubao.
Chinese models performed poorly on Safety Theoretical Knowledge topics, indicating a lack of understanding in areas such as network safety and information safety.
Related works have focused on evaluating LLM factuality through benchmarks like SimpleQA and Chinese SimpleQA, as well as methods like self-reflection and RAG.
The introduction of Chinese SafetyQA is a significant advancement for assessing LLM factuality specifically within the Chinese context, filling a crucial gap in existing safety benchmarks.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yingshui Tan, Boren Zheng, Baihui Zheng, Kerui Cao, Huiyun Jing, Jincheng Wei, Jiaheng Liu, Yancheng He, Wenbo Su, Xiangyong Zhu, Bo Zheng, Kaifu Zhang

arXiv: 2412.15265v2 - DOI (cs.CL)

License: CC BY 4.0

Abstract: With the rapid advancement of Large Language Models (LLMs), significant safety concerns have emerged. Fundamentally, the safety of large language models is closely linked to the accuracy, comprehensiveness, and clarity of their understanding of safety knowledge, particularly in domains such as law, policy and ethics. This factuality ability is crucial in determining whether these models can be deployed and applied safely and compliantly within specific regions. To address these challenges and better evaluate the factuality ability of LLMs to answer short questions, we introduce the Chinese SafetyQA benchmark. Chinese SafetyQA has several properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate, Safety-related, Harmless). Based on Chinese SafetyQA, we perform a comprehensive evaluation on the factuality abilities of existing LLMs and analyze how these capabilities relate to LLM abilities, e.g., RAG ability and robustness against attacks.

Submitted to arXiv on 17 Dec. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2412.15265v2

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the rapidly evolving landscape of Large Language Models (LLMs), concerns surrounding their safety have become increasingly prominent. The accuracy, comprehensiveness, and clarity of LLMs' understanding of safety knowledge, particularly in domains like law, policy, and ethics, are crucial for ensuring their safe deployment and compliance within specific regions. To address these challenges and assess the factuality abilities of LLMs in answering short questions, the Chinese SafetyQA benchmark was introduced. This benchmark is characterized by being Chinese, diverse, high-quality, static, easy to evaluate, safety-related, and harmless. An analysis was conducted on the results of different subtopics within the Chinese Safety Domain dataset. It was observed that o1-preview performed the best across all categories while the gpt-4o-mini model exhibited the lowest performance. GPT models showed better proficiency in Physical & Mental Health topics due to more training on international ESG issues. However, non-Chinese models struggled in Illegal & Regulatory Compliance topics compared to Chinese models like Qwen-series and Doubao which displayed better performance due to specialized training efforts on Chinese legal knowledge. Furthermore,a comparison revealed that all Chinese models performed poorly on Safety Theoretical Knowledge topics indicating a lack of understanding in areas such as network safety and information safety. Related works have focused on evaluating LLM factuality and simple QA with benchmarks like SimpleQA and Chinese SimpleQA emphasizing ease of evaluation. Efforts have also been made to enhance LLM factuality through methods like self-reflection and RAG. In conclusion,the introduction of Chinese SafetyQA marks a significant advancement in assessing the factuality abilities of LLMs specifically within the Chinese context.This benchmark fills a crucial gap in existing safety benchmarks by focusing on compliance and legality evaluations for regions like China. Overall,this work contributes to enhancing our understanding of LLM capabilities in addressing safety-related challenges effectively.

- Concerns surrounding safety in Large Language Models (LLMs) are increasingly prominent, especially in domains like law, policy, and ethics.
- The Chinese SafetyQA benchmark was introduced to assess LLMs' factuality abilities in answering short questions related to safety.
- Analysis of the Chinese Safety Domain dataset showed that o1-preview performed the best overall, while the gpt-4o-mini model had the lowest performance.
- GPT models excelled in Physical & Mental Health topics but struggled in Illegal & Regulatory Compliance topics compared to specialized Chinese models like Qwen-series and Doubao.
- Chinese models performed poorly on Safety Theoretical Knowledge topics, indicating a lack of understanding in areas such as network safety and information safety.
- Related works have focused on evaluating LLM factuality through benchmarks like SimpleQA and Chinese SimpleQA, as well as methods like self-reflection and RAG.
- The introduction of Chinese SafetyQA is a significant advancement for assessing LLM factuality specifically within the Chinese context, filling a crucial gap in existing safety benchmarks.

Summary- People are worried about how safe big language models are, especially in areas like law and ethics. - A test called Chinese SafetyQA was made to see how well these models can answer questions about safety. - One model called o1-preview did the best overall in the safety test, while another model called gpt-4o-mini did the worst. - Some models are good at health topics but struggle with illegal things compared to other specialized Chinese models. - Chinese models don't do well on safety knowledge topics like network safety. Definitions- Safety: Being free from harm or danger. - Language Models: Computer programs that can understand and generate human language. - Factuality: The quality of being true or based on facts. - Benchmark: A standard or point of reference used for comparison or evaluation.

Large Language Models (LLMs) have become increasingly prevalent in today's digital landscape, with their ability to generate human-like text and answer complex questions. However, concerns surrounding the safety of these models have also risen, particularly in domains such as law, policy, and ethics. Ensuring the accuracy and comprehensiveness of LLMs' understanding of safety knowledge is crucial for their safe deployment and compliance within specific regions. To address these challenges, a team of researchers introduced the Chinese SafetyQA benchmark. This benchmark is characterized by being Chinese, diverse, high-quality, static, easy to evaluate, safety-related, and harmless. It aims to assess the factuality abilities of LLMs in answering short questions related to safety topics. The research paper analyzed the results of different subtopics within the Chinese Safety Domain dataset. The results showed that o1-preview performed the best across all categories while gpt-4o-mini exhibited the lowest performance. This indicates that GPT models are more proficient in Physical & Mental Health topics due to their training on international ESG issues. On the other hand, non-Chinese models struggled in Illegal & Regulatory Compliance topics compared to Chinese models like Qwen-series and Doubao which displayed better performance due to specialized training efforts on Chinese legal knowledge. Moreover,a comparison revealed that all Chinese models performed poorly on Safety Theoretical Knowledge topics indicating a lack of understanding in areas such as network safety and information safety. This highlights a significant gap in current LLM capabilities when it comes to addressing theoretical aspects of safety. Previous works have focused on evaluating LLM factuality through benchmarks like SimpleQA and Chinese SimpleQA which emphasize ease of evaluation. Efforts have also been made to enhance LLM factuality through methods like self-reflection and RAG (Retrieval-Augmented Generation). However,the introduction of Chinese SafetyQA marks a significant advancement in assessing LLM factuality specifically within the context of China. This benchmark fills a crucial gap in existing safety benchmarks by focusing on compliance and legality evaluations for regions like China. It not only provides a standardized evaluation method but also highlights the need for further research and development in this area. Overall, this work contributes to enhancing our understanding of LLM capabilities in addressing safety-related challenges effectively. As LLMs continue to evolve and become more integrated into various industries, it is essential to ensure their factuality and accuracy when it comes to sensitive topics such as safety. The Chinese SafetyQA benchmark serves as an important step towards achieving this goal and promoting responsible use of LLMs in the future.

Created on 25 Feb. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

64.5%

SeaLLMs -- Large Language Models for Southeast Asia

cs.CL

64.0%

Code Llama: Open Foundation Models for Code

cs.CL

63.2%

Training a Helpful and Harmless Assistant with Reinforcement Learning from Hu…

cs.CL

62.4%

Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language…

cs.CL

62.3%

Hermit Kingdom Through the Lens of Multiple Perspectives: A Case Study of LLM…

cs.CL

62.3%

Effective Long-Context Scaling of Foundation Models

cs.CL

62.2%

A Survey on Evaluation of Large Language Models

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.