SpeechJudge: Towards Human-Level Judgment for Speech Naturalness
AI-generated Key Points
- Aligning large generative models with human feedback is a critical challenge in speech synthesis due to the lack of a comprehensive human preference dataset.
- The SpeechJudge suite was introduced to address this issue, consisting of a dataset, benchmark, and reward model focused on naturalness – a key subjective metric in speech synthesis.
- SpeechJudge-Data is a substantial corpus comprising 99K speech pairs annotated for both intelligibility and naturalness preference, incorporating diverse zero-shot text-to-speech (TTS) models across various speech styles and languages.
- SpeechJudge-Eval serves as a rigorous benchmark for evaluating speech naturalness judgment and highlighted the shortcomings of existing metrics and AudioLLMs in this task.
- SpeechJudge-GRM, a generative reward model based on Qwen2.5-Omni-7B, demonstrated superior performance on the SpeechJudge-Eval benchmark through post-training processes involving Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales and Reinforcement Learning (RL) with GRPO on challenging cases.
- SpeechJudge-GRM achieved an accuracy of 77.2% (and 79.4% after inference-time scaling @10), surpassing a classic Bradley-Terry reward model at 72.7%.
- This tool can enhance the alignment of speech generation models with human preferences during the post-training phase, providing valuable resources for researchers and developers in advancing the quality and naturalness of synthesized speech outputs effectively.
Authors: Xueyao Zhang, Chaoren Wang, Huan Liao, Ziniu Li, Yuancheng Wang, Li Wang, Dongya Jia, Yuanzhe Chen, Xiulin Li, Zhuo Chen, Zhizheng Wu
Abstract: Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness--one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.