CDPAM: Contrastive learning for perceptual audio similarity

AI-generated keywords: CDPAM Contrastive Learning Perceptual Audio Similarity Speech Processing Deep Learning

AI-generated Key Points

  • CDPAM is a contrastive learning approach for perceptual audio similarity
  • It combines contrastive learning and multi-dimensional representations to build robust models from limited data
  • Human judgments on triplet comparisons are collected to improve generalization to a broader range of audio perturbations
  • CDPAM correlates well with human responses across nine varied datasets
  • It is applied to speech synthesis and enhancement methods using the DEMUCS architecture-based speech denoiser
  • Evaluation is performed on 500 randomly selected audio clips from the VCTK test set using objective measures (PESQ, STOI, CSIG) and subjective measures
  • Adding CDPAM as a metric significantly improves the performance of existing speech synthesis and enhancement methods in both objective and subjective tests
  • CDPAM captures perceptual audio similarity effectively and enhances various speech processing tasks
  • It provides a valuable contribution to deep learning-based speech processing methods by introducing a robust metric that correlates well with human perception across different datasets.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Pranay Manocha, Zeyu Jin, Richard Zhang, Adam Finkelstein

Dataset, code and sound examples can be found at https://github.com/pranaymanocha/PerceptualAudio/tree/master/cdpam
License: CC BY 4.0

Abstract: Many speech processing methods based on deep learning require an automatic and differentiable audio metric for the loss function. The DPAM approach of Manocha et al. learns a full-reference metric trained directly on human judgments, and thus correlates well with human perception. However, it requires a large number of human annotations and does not generalize well outside the range of perturbations on which it was trained. This paper introduces CDPAM, a metric that builds on and advances DPAM. The primary improvement is to combine contrastive learning and multi-dimensional representations to build robust models from limited data. In addition, we collect human judgments on triplet comparisons to improve generalization to a broader range of audio perturbations. CDPAM correlates well with human responses across nine varied datasets. We also show that adding this metric to existing speech synthesis and enhancement methods yields significant improvement, as measured by objective and subjective tests.

Submitted to arXiv on 09 Feb. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2102.05109v1

The paper introduces CDPAM, a contrastive learning approach for perceptual audio similarity. CDPAM builds on and advances the existing DPAM by combining contrastive learning and multi-dimensional representations to build robust models from limited data. In addition, the authors collect human judgments on triplet comparisons to improve generalization to a broader range of audio perturbations. CDPAM correlates well with human responses across nine varied datasets. To demonstrate its effectiveness, the authors apply it to speech synthesis and enhancement methods using the DEMUCS architecture-based speech denoiser. The evaluation is performed on 500 randomly selected audio clips from the VCTK test set using objective measures such as PESQ, STOI, CSIG, and subjective measures. The results show that adding CDPAM as a metric significantly improves the performance of existing speech synthesis and enhancement methods in both objective and subjective tests. This demonstrates the effectiveness of CDPAM in capturing perceptual audio similarity and its potential for improving various speech processing tasks. Overall, CDPAM provides a valuable contribution to deep learning-based speech processing methods by introducing a robust metric that correlates well with human perception across different datasets and enhances the performance of existing techniques in synthesizing and enhancing speech.
Created on 07 Nov. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.