AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation

AI-generated keywords: Audio-visual segmentation

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

The Segment Anything Model (SAM) is effective in visual segmentation tasks.
AV-SAM is a framework that combines audio and visual information for audio-visual tasks.
AV-SAM leverages pixel-wise audio-visual fusion by aggregating cross-modal representations from audio and visual features.
The aggregated cross-modal features are used to generate final audio-visual segmentation masks.
AV-SAM achieves competitive performance in sounding object localization and segmentation.
Extensive experiments were conducted on the Flickr-SoundNet and AVSBench datasets.
AV-SAM provides accurate sounding object masks corresponding to the audio input.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Shentong Mo, Yapeng Tian

arXiv: 2305.01836v1 - DOI (cs.CV)

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Segment Anything Model (SAM) has recently shown its powerful effectiveness in visual segmentation tasks. However, there is less exploration concerning how SAM works on audio-visual tasks, such as visual sound localization and segmentation. In this work, we propose a simple yet effective audio-visual localization and segmentation framework based on the Segment Anything Model, namely AV-SAM, that can generate sounding object masks corresponding to the audio. Specifically, our AV-SAM simply leverages pixel-wise audio-visual fusion across audio features and visual features from the pre-trained image encoder in SAM to aggregate cross-modal representations. Then, the aggregated cross-modal features are fed into the prompt encoder and mask decoder to generate the final audio-visual segmentation masks. We conduct extensive experiments on Flickr-SoundNet and AVSBench datasets. The results demonstrate that the proposed AV-SAM can achieve competitive performance on sounding object localization and segmentation.

Submitted to arXiv on 03 May. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2305.01836v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

The Segment Anything Model (SAM) has proven to be highly effective in visual segmentation tasks. To address the limited exploration of SAM's performance in audio-visual tasks such as visual sound localization and segmentation, the authors propose a framework called AV-SAM that combines audio and visual information to generate sounding object masks corresponding to the audio. AV-SAM leverages pixel-wise audio-visual fusion by aggregating cross-modal representations from audio features and visual features obtained from the pre-trained image encoder in SAM. These aggregated cross-modal features are then fed into the prompt encoder and mask decoder to generate final audio-visual segmentation masks. To evaluate the performance of AV-SAM, extensive experiments are conducted on two datasets: Flickr-SoundNet and AVSBench. The results demonstrate that AV-SAM achieves competitive performance in sounding object localization and segmentation. Overall, this work presents a simple yet effective approach for audio-visual localization and segmentation based on the Segment Anything Model. By leveraging cross-modal fusion of audio and visual features, AV-SAM provides accurate sounding object masks corresponding to the audio input.

- The Segment Anything Model (SAM) is effective in visual segmentation tasks.
- AV-SAM is a framework that combines audio and visual information for audio-visual tasks.
- AV-SAM leverages pixel-wise audio-visual fusion by aggregating cross-modal representations from audio and visual features.
- The aggregated cross-modal features are used to generate final audio-visual segmentation masks.
- AV-SAM achieves competitive performance in sounding object localization and segmentation.
- Extensive experiments were conducted on the Flickr-SoundNet and AVSBench datasets.
- AV-SAM provides accurate sounding object masks corresponding to the audio input.

Summary: - There is a model called SAM that is good at separating things in pictures. - AV-SAM is a special way of using both sound and pictures to do tasks together. - AV-SAM combines sound and pictures to make a final picture that shows where things are. - AV-SAM works well at finding and separating objects that make sounds. - People tested AV-SAM on different datasets and it was very accurate at showing where sounds come from. Definitions- Segment Anything Model (SAM): A model that can separate things in pictures. - Audio-visual (AV) tasks: Tasks that use both sound and pictures together. - Pixel-wise: Looking at each tiny part of a picture or sound separately. - Fusion: Combining different things together to make something new. - Cross-modal representations: Different ways of showing the same thing, like using both sound and pictures to show where something is.

Exploring the Segment Anything Model (SAM) for Audio-Visual Tasks

The Segment Anything Model (SAM) has been widely used in visual segmentation tasks, but its performance in audio-visual tasks such as visual sound localization and segmentation is largely unexplored. To address this gap, a new framework called AV-SAM was proposed to combine audio and visual information to generate sounding object masks corresponding to the audio input. This article will explore how AV-SAM works and discuss its performance on two datasets: Flickr-SoundNet and AVSBench.

How Does AV-SAM Work?

AV-SAM leverages pixel-wise audio-visual fusion by aggregating cross-modal representations from both audio features and visual features obtained from a pre-trained image encoder in SAM. These aggregated cross-modal features are then fed into the prompt encoder and mask decoder to generate final audio-visual segmentation masks.

Performance Evaluation of AV_SAM

To evaluate the performance of AV_SAM, extensive experiments were conducted on two datasets: Flickr SoundNet and AVSBench. The results demonstrate that AV_SAM achieves competitive performance in sounding object localization and segmentation compared with other state of the art methods.

Conclusion

Overall, this work presents a simple yet effective approach for audio–visual localization and segmentation based on the Segment Anything Model (SAM). By leveraging cross–modal fusion of both audio features and visual features, AV_SAM provides accurate sounding object masks corresponding to the given input audio signal.

Created on 10 Nov. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

76.4%

Segment Anything

cs.CV

76.2%

Fast Segment Anything

cs.CV

73.3%

Can SAM Count Anything? An Empirical Study on SAM Counting

cs.CV

72.9%

SAM3D: Segment Anything in 3D Scenes

cs.CV

66.8%

Role of Audio in Audio-Visual Video Summarization

cs.CV

66.5%

SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Spee…

cs.CL

64.9%

Image Segmentation Algorithms Overview

cs.CV

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.