Spoken question answering for visual queries

AI-generated keywords: Spoken Question Answering Visual Queries Multi-Modal Model Text-to-Speech Interspeech 2025

AI-generated Key Points

  • The paper discusses the development of a system for spoken Visual Question Answering (SVQA) that integrates speech and image modalities.
  • The authors address the challenge of training and evaluating SVQA models due to the lack of datasets encompassing text, speech, and images.
  • They synthesized VQA datasets using two zero-shot Text-to-Speech (TTS) models to overcome this issue.
  • Initial findings suggest that a model trained solely with synthesized speech can achieve performance levels similar to those trained on textual QAs.
  • The choice of TTS model was found to have only a minor impact on accuracy.
  • The research has been accepted for presentation at Interspeech 2025 with additional results, indicating its significance in advancing QA systems to include spoken interactions with visual content.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Nimrod Shabtay, Zvi Kons, Avihu Dekel, Hagai Aronowitz, Ron Hoory, Assaf Arbelle

Accepted for Interspeech 2025 (with additional results)
License: CC BY 4.0

Abstract: Question answering (QA) systems are designed to answer natural language questions. Visual QA (VQA) and Spoken QA (SQA) systems extend the textual QA system to accept visual and spoken input respectively. This work aims to create a system that enables user interaction through both speech and images. That is achieved through the fusion of text, speech, and image modalities to tackle the task of spoken VQA (SVQA). The resulting multi-modal model has textual, visual, and spoken inputs and can answer spoken questions on images. Training and evaluating SVQA models requires a dataset for all three modalities, but no such dataset currently exists. We address this problem by synthesizing VQA datasets using two zero-shot TTS models. Our initial findings indicate that a model trained only with synthesized speech nearly reaches the performance of the upper-bounding model trained on textual QAs. In addition, we show that the choice of the TTS model has a minor impact on accuracy.

Submitted to arXiv on 29 May. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2505.23308v1

The paper "Spoken question answering for visual queries" by Nimrod Shabtay, Zvi Kons, Avihu Dekel, Hagai Aronowitz, Ron Hoory, and Assaf Arbelle from IBM Research in Israel and Tel-Aviv University addresses the development of a system that enables user interaction through both speech and images. This system tackles the task of spoken Visual Question Answering (SVQA) by fusing text, speech, and image modalities to create a multi-modal model capable of answering spoken questions about images. One of the key challenges in training and evaluating SVQA models is the lack of datasets that encompass all three modalities. To address this issue, the authors synthesized VQA datasets using two zero-shot Text-to-Speech (TTS) models. Their initial findings suggest that a model trained solely with synthesized speech can achieve performance levels close to those of models trained on textual QAs. Furthermore, they demonstrate that the choice of TTS model has only a minor impact on accuracy. The authors also mention that their work has been accepted for presentation at Interspeech 2025 with additional results. Overall, this research contributes to advancing the field of QA systems by expanding capabilities to include spoken interactions with visual content.
Created on 01 Jun. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.