FLAM: Frame-Wise Language-Audio Modeling

AI-generated keywords: FLAM frame-wise audio understanding open-vocabulary contrastive audio-language model fine-grained and interpretable audio understanding multimodal learning

AI-generated Key Points

  • FLAM (Frame-Wise Language-Audio Modeling) is designed to address limitations of existing multi-modal audio-language models in frame-wise audio understanding.
  • FLAM introduces an innovative approach for precise localization of specific sound events, overcoming challenges faced by traditional sound event detection models.
  • The model employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations during training.
  • FLAM leverages a large-scale dataset with diverse audio events, LLM-generated captions, and simulation techniques for frame-wise supervision.
  • Experimental results show that FLAM significantly enhances open-vocabulary localization capabilities while maintaining strong performance in global retrieval and downstream tasks.
  • The model's approach to frame-wise audio-language alignment has the potential to drive innovation in the field of audio understanding through natural language queries.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seetharaman, Justin Salamon

Accepted at ICML 2025
License: CC BY 4.0

Abstract: Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.

Submitted to arXiv on 08 May. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2505.05335v1

FLAM (Frame-Wise Language-Audio Modeling) is a cutting-edge model designed to address the limitations of existing multi-modal audio-language models in frame-wise audio understanding. This innovative approach opens up new possibilities for audio understanding, benefiting applications such as content indexing, accessibility, and multimedia retrieval. While these models excel at text-audio retrieval, they struggle with pinpointing when specific sound events occur. Traditional sound event detection models are limited by pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In response to these challenges, FLAM introduces an that can precisely localize specific sound events. By employing a memory-efficient and calibrated frame-wise objective with logit adjustment, FLAM effectively addresses spurious correlations such as event dependencies and label imbalances during training. To enable frame-wise supervision, the model leverages a large-scale dataset with diverse audio events, LLM-generated captions, and simulation techniques. Experimental results and case studies demonstrate that FLAM significantly enhances open-vocabulary localization capabilities while maintaining strong performance in global retrieval and downstream tasks. The model's innovative approach to frame-wise audio-language alignment has the potential to drive innovation in the field of , allowing for more accurate and precise understanding of complex audio data through natural language queries. While there are no significant ethical risks associated with FLAM, the researchers emphasize responsible use of the model in real-world scenarios. The project acknowledges the valuable contributions of individuals like Yuanbo Hou and Samuel Lavoie who provided insightful discussions and advice throughout its development. FLAM represents a significant advancement in multimodal learning and has the potential to drive innovation in the field of audio understanding through natural language queries.
Created on 22 Oct. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.