A Closer Look at Weakly-Supervised Audio-Visual Source Localization
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- Authors Shentong Mo and Pedro Morgado focus on predicting the location of visual sound sources in videos.
- Traditional ground-truth annotation methods for sounding objects are costly, leading to the development of weakly-supervised localization methods.
- Existing evaluation protocols have flaws, such as early stopping with fully annotated datasets and assuming sound sources are always present.
- The authors propose an extension to benchmarks like Flickr SoundNet and VGG-Sound Sources by including negative samples in the test set.
- New metrics are introduced to balance localization accuracy and recall for a more comprehensive evaluation of prior methods.
- Many existing approaches struggle to identify negatives and suffer from overfitting due to heavy reliance on early stopping.
- Mo and Morgado present a novel approach using extreme visual dropout and momentum encoders, achieving state-of-the-art performance on benchmarks.
- The authors provide their code and pre-trained models for further research on GitHub (https://github.com/stoneMo/SLAVC).
- This study emphasizes the importance of refining evaluation protocols in weakly-supervised audio-visual source localization research.
Authors: Shentong Mo, Pedro Morgado
Abstract: Audio-visual source localization is a challenging task that aims to predict the location of visual sound sources in a video. Since collecting ground-truth annotations of sounding objects can be costly, a plethora of weakly-supervised localization methods that can learn from datasets with no bounding-box annotations have been proposed in recent years, by leveraging the natural co-occurrence of audio and visual signals. Despite significant interest, popular evaluation protocols have two major flaws. First, they allow for the use of a fully annotated dataset to perform early stopping, thus significantly increasing the annotation effort required for training. Second, current evaluation metrics assume the presence of sound sources at all times. This is of course an unrealistic assumption, and thus better metrics are necessary to capture the model's performance on (negative) samples with no visible sound sources. To accomplish this, we extend the test set of popular benchmarks, Flickr SoundNet and VGG-Sound Sources, in order to include negative samples, and measure performance using metrics that balance localization accuracy and recall. Using the new protocol, we conducted an extensive evaluation of prior methods, and found that most prior works are not capable of identifying negatives and suffer from significant overfitting problems (rely heavily on early stopping for best results). We also propose a new approach for visual sound source localization that addresses both these problems. In particular, we found that, through extreme visual dropout and the use of momentum encoders, the proposed approach combats overfitting effectively, and establishes a new state-of-the-art performance on both Flickr SoundNet and VGG-Sound Source. Code and pre-trained models are available at https://github.com/stoneMo/SLAVC.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.