Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- Introduces a new approach to any-to-many voice conversion using location-relative, sequence-to-sequence (seq2seq) modeling
- Utilizes text supervision during training and combines a bottle-neck feature extractor (BNE) with a seq2seq synthesis module
- Trains an encoder-decoder based hybrid connectionist temporal classification attention (CTC-attention) phoneme recognizer with a bottle neck layer in its encoder during the training stage
- Uses the BNE to extract speaker independent, dense and rich spoken content representations from spectral features
- Trains a multi-speaker location relative attention based seq2seq synthesis model to reconstruct spectral features from the bottle neck features, which conditions on speaker representations for speaker identity control in the generated speech
- Down samples input spectral feature along the temporal dimension to overcome difficulties of aligning long sequences using seq2seq models
- Equips synthesis model with a discretized mixture of logistic (MoL) attention mechanism
- Can conduct any-to-many voice conversion since the phoneme recognizer is trained with large speech recognition data corpus
- Outperforms existing methods in terms of both naturalness and speaker similarity according to objective and subjective evaluations
- Feature selection and model design strategies are effective in this approach according to ablation studies
- Proposed VC approach can be extended to support any-to-any VC achieving high performance according to objective and subjective evaluations
- Presents an innovative solution for any-to-many voice conversion that can potentially improve various applications such as virtual assistants or personalized voice services.
Authors: Songxiang Liu, Yuewen Cao, Disong Wang, Xixin Wu, Xunying Liu, Helen Meng
Abstract: This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach, which utilizes text supervision during training. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. During the training stage, an encoder-decoder-based hybrid connectionist-temporal-classification-attention (CTC-attention) phoneme recognizer is trained, whose encoder has a bottle-neck layer. A BNE is obtained from the phoneme recognizer and is utilized to extract speaker-independent, dense and rich spoken content representations from spectral features. Then a multi-speaker location-relative attention based seq2seq synthesis model is trained to reconstruct spectral features from the bottle-neck features, conditioning on speaker representations for speaker identity control in the generated speech. To mitigate the difficulties of using seq2seq models to align long sequences, we down-sample the input spectral feature along the temporal dimension and equip the synthesis model with a discretized mixture of logistic (MoL) attention mechanism. Since the phoneme recognizer is trained with large speech recognition data corpus, the proposed approach can conduct any-to-many voice conversion. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity. Ablation studies are conducted to confirm the effectiveness of feature selection and model design strategies in the proposed approach. The proposed VC approach can readily be extended to support any-to-any VC (also known as one/few-shot VC), and achieve high performance according to objective and subjective evaluations.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.