Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling

AI-generated keywords: Voice Conversion seq2seq CTC-attention MoL Attention Feature Selection

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Introduces a new approach to any-to-many voice conversion using location-relative, sequence-to-sequence (seq2seq) modeling
  • Utilizes text supervision during training and combines a bottle-neck feature extractor (BNE) with a seq2seq synthesis module
  • Trains an encoder-decoder based hybrid connectionist temporal classification attention (CTC-attention) phoneme recognizer with a bottle neck layer in its encoder during the training stage
  • Uses the BNE to extract speaker independent, dense and rich spoken content representations from spectral features
  • Trains a multi-speaker location relative attention based seq2seq synthesis model to reconstruct spectral features from the bottle neck features, which conditions on speaker representations for speaker identity control in the generated speech
  • Down samples input spectral feature along the temporal dimension to overcome difficulties of aligning long sequences using seq2seq models
  • Equips synthesis model with a discretized mixture of logistic (MoL) attention mechanism
  • Can conduct any-to-many voice conversion since the phoneme recognizer is trained with large speech recognition data corpus
  • Outperforms existing methods in terms of both naturalness and speaker similarity according to objective and subjective evaluations
  • Feature selection and model design strategies are effective in this approach according to ablation studies
  • Proposed VC approach can be extended to support any-to-any VC achieving high performance according to objective and subjective evaluations
  • Presents an innovative solution for any-to-many voice conversion that can potentially improve various applications such as virtual assistants or personalized voice services.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Songxiang Liu, Yuewen Cao, Disong Wang, Xixin Wu, Xunying Liu, Helen Meng

Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing

Abstract: This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach, which utilizes text supervision during training. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. During the training stage, an encoder-decoder-based hybrid connectionist-temporal-classification-attention (CTC-attention) phoneme recognizer is trained, whose encoder has a bottle-neck layer. A BNE is obtained from the phoneme recognizer and is utilized to extract speaker-independent, dense and rich spoken content representations from spectral features. Then a multi-speaker location-relative attention based seq2seq synthesis model is trained to reconstruct spectral features from the bottle-neck features, conditioning on speaker representations for speaker identity control in the generated speech. To mitigate the difficulties of using seq2seq models to align long sequences, we down-sample the input spectral feature along the temporal dimension and equip the synthesis model with a discretized mixture of logistic (MoL) attention mechanism. Since the phoneme recognizer is trained with large speech recognition data corpus, the proposed approach can conduct any-to-many voice conversion. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity. Ablation studies are conducted to confirm the effectiveness of feature selection and model design strategies in the proposed approach. The proposed VC approach can readily be extended to support any-to-any VC (also known as one/few-shot VC), and achieve high performance according to objective and subjective evaluations.

Submitted to arXiv on 06 Sep. 2020

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2009.02725v3

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

This paper introduces a novel approach to any-to-many voice conversion using location-relative, sequence-to-sequence (seq2seq) modeling. The proposed method utilizes text supervision during training and combines a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. During the training stage, an encoder-decoder based hybrid connectionist temporal classification attention (CTC-attention) phoneme recognizer is trained which has a bottle neck layer in its encoder. A BNE is obtained from the phoneme recognizer and used to extract speaker independent, dense and rich spoken content representations from spectral features. To reconstruct spectral features from the bottle neck features, a multi-speaker location relative attention based seq2seq synthesis model is trained. This model conditions on speaker representations for speaker identity control in the generated speech. To overcome the difficulties of aligning long sequences using seq2seq models, the input spectral feature is down sampled along the temporal dimension. Additionally, the synthesis model is equipped with a discretized mixture of logistic (MoL) attention mechanism. The proposed approach can conduct any-to-many voice conversion since the phoneme recognizer is trained with large speech recognition data corpus. Objective and subjective evaluations show that this approach outperforms existing methods in terms of both naturalness and speaker similarity. Ablation studies confirm that feature selection and model design strategies are effective in this approach. Furthermore, this proposed VC approach can be extended to support any-to-any VC (also known as one/few shot VC), achieving high performance according to objective and subjective evaluations. Overall, this study presents an innovative solution for any-to-many voice conversion that can potentially improve various applications such as virtual assistants or personalized voice services.
Created on 18 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.