StreamVC: Real-Time Low-Latency Voice Conversion

AI-generated keywords: StreamVC cutting-edge voice conversion real-time communication low-latency

AI-generated Key Points

  • StreamVC is a state-of-the-art streaming voice conversion solution
  • Matches voice timbre of target speech while preserving content and prosody of source speech
  • Generates resulting waveform with low latency directly from input signal, ideal for real-time communication scenarios
  • Developed based on SoundStream neural audio codec architecture and training strategy
  • Boasts lightweight yet high-quality speech synthesis capabilities
  • Learns soft speech units causally to preserve pitch stability without compromising source timbre information
  • Utilizes LibriTTS train-clean-100 subset to derive cluster centroids for HuBERT pseudo-labels in training dataset
  • Extensive training over 1.3 million steps with batch size of 128
  • Evaluation dataset created to assess generalization capabilities with unseen source and target speakers during training
  • Comprehensive evaluations conducted on pairs of source and target speech utterances from various datasets like LibriTTS and VCTK
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yang Yang, Yury Kartynnik, Yunpeng Li, Jiuqiang Tang, Xing Li, George Sung, Matthias Grundmann

Accepted to ICASSP 2024
License: CC BY 4.0

Abstract: We present StreamVC, a streaming voice conversion solution that preserves the content and prosody of any source speech while matching the voice timbre from any target speech. Unlike previous approaches, StreamVC produces the resulting waveform at low latency from the input signal even on a mobile platform, making it applicable to real-time communication scenarios like calls and video conferencing, and addressing use cases such as voice anonymization in these scenarios. Our design leverages the architecture and training strategy of the SoundStream neural audio codec for lightweight high-quality speech synthesis. We demonstrate the feasibility of learning soft speech units causally, as well as the effectiveness of supplying whitened fundamental frequency information to improve pitch stability without leaking the source timbre information.

Submitted to arXiv on 05 Jan. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2401.03078v1

StreamVC is a state-of-the-art streaming voice conversion solution that seamlessly matches the voice timbre of a target speech while preserving the content and prosody of the source speech. Unlike previous approaches, this innovative technology generates the resulting waveform with low latency directly from the input signal, making it ideal for real-time communication scenarios such as calls and video conferencing where voice anonymization may be necessary. Developed based on the architecture and training strategy of the SoundStream neural audio codec, StreamVC boasts lightweight yet high-quality speech synthesis capabilities. A key aspect highlighted in this research is its ability to learn soft speech units causally, effectively preserving pitch stability without compromising source timbre information. To ensure optimal performance, StreamVC utilizes the LibriTTS train-clean-100 subset to derive cluster centroids for HuBERT pseudo-labels, creating a robust training dataset consisting of 555.15 hours of speech from 2311 speakers. The model undergoes extensive training over 1.3 million steps with a batch size of 128. To evaluate its generalization capabilities, an evaluation dataset is created where both source and target speakers are unseen during training. Comprehensive evaluations are conducted on a significant number of pairs of source and target speech utterances selected from various datasets like LibriTTS and VCTK. Baseline models and evaluation metrics are employed to assess the effectiveness and efficiency of StreamVC in comparison to existing methods. The results showcase promising advancements in real-time low-latency voice conversion technology, paving the way for enhanced communication experiences across diverse applications.
Created on 24 Oct. 2024

Assess the quality of the AI-generated content by voting

Score: 1

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.