VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

AI-generated keywords: VoiceCraft token infilling neural codec speech editing zero-shot text-to-speech

AI-generated Key Points

  • VoiceCraft is a groundbreaking neural codec language model achieving state-of-the-art performance in audio applications.
  • Developed by Puyuan Peng, Po-Yao Huang, Daniel Li, Abdelrahman Mohamed, and David Harwath using Transformer decoder architecture and unique token rearrangement procedure.
  • Demonstrates ability to produce edited speech nearly indistinguishable from unedited recordings in terms of naturalness.
  • Outperforms previous state-of-the-art models like VALLE and XTTS-v2 in diverse scenarios with different accents, speaking styles, recording conditions, background noise, and music.
  • Introduction of RealEdit dataset for rigorous testing under various conditions.
  • Detailed results show impressive scores in both objective metrics like speaker similarity and subjective human evaluation metrics.
  • Slight gap in naturalness compared to ground truth recordings on platforms like YouTube utterances but overall highly competitive performance.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Puyuan Peng, Po-Yao Huang, Daniel Li, Abdelrahman Mohamed, David Harwath

Data, code, and model weights are available at https://github.com/jasonppy/VoiceCraft
License: CC BY-NC-SA 4.0

Abstract: We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts. VoiceCraft employs a Transformer decoder architecture and introduces a token rearrangement procedure that combines causal masking and delayed stacking to enable generation within an existing sequence. On speech editing tasks, VoiceCraft produces edited speech that is nearly indistinguishable from unedited recordings in terms of naturalness, as evaluated by humans; for zero-shot TTS, our model outperforms prior SotA models including VALLE and the popular commercial model XTTS-v2. Crucially, the models are evaluated on challenging and realistic datasets, that consist of diverse accents, speaking styles, recording conditions, and background noise and music, and our model performs consistently well compared to other models and real recordings. In particular, for speech editing evaluation, we introduce a high quality, challenging, and realistic dataset named RealEdit. We encourage readers to listen to the demos at https://jasonppy.github.io/VoiceCraft_web.

Submitted to arXiv on 25 Mar. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2403.16973v1

VoiceCraft is a groundbreaking neural codec language model that has achieved state-of-the-art performance in both and applications on various types of audio content such as audiobooks, internet videos, and podcasts. Developed by Puyuan Peng, Po-Yao Huang, Daniel Li, Abdelrahman Mohamed, and David Harwath, VoiceCraft utilizes a Transformer decoder architecture along with a unique token rearrangement procedure that combines causal masking and delayed stacking to enable seamless generation within an existing sequence. In terms of tasks, VoiceCraft has demonstrated the ability to produce edited speech that is nearly indistinguishable from unedited recordings in terms of naturalness. Human evaluations have confirmed the high quality of the edited output. Additionally, when it comes to , VoiceCraft outperforms previous state-of-the-art models including VALLE and XTTS-v2. The model excels in diverse scenarios with different accents, speaking styles, recording conditions, background noise, and music. One notable aspect of VoiceCraft's evaluation is the introduction of a challenging and realistic dataset called RealEdit for . This dataset ensures that the model's performance is rigorously tested under various conditions. The authors encourage readers to explore demos showcasing VoiceCraft's capabilities at https://jasonppy.github.io/VoiceCraft_web. Furthermore, detailed results from reveal that VoiceCraft achieves impressive scores in both objective metrics like speaker similarity and subjective human evaluation metrics. The model demonstrates exceptional performance in terms of intelligibility and speaker similarity compared to ground truth data. While there may be a slight gap in naturalness between VoiceCraft-generated speech and ground truth recordings on platforms like YouTube utterances, overall performance remains highly competitive. Overall, VoiceCraft represents a significant advancement in the field of speech processing technology with its ability to deliver top-tier results across various applications in real-world settings. Its innovative approach and superior performance make it a valuable tool for professionals working with audio content across different mediums.
Created on 02 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.