Spoken question answering for visual queries

AI-generated keywords: Spoken Question Answering Visual Queries Multi-Modal Model Text-to-Speech Interspeech 2025

AI-generated Key Points

The paper discusses the development of a system for spoken Visual Question Answering (SVQA) that integrates speech and image modalities.
The authors address the challenge of training and evaluating SVQA models due to the lack of datasets encompassing text, speech, and images.
They synthesized VQA datasets using two zero-shot Text-to-Speech (TTS) models to overcome this issue.
Initial findings suggest that a model trained solely with synthesized speech can achieve performance levels similar to those trained on textual QAs.
The choice of TTS model was found to have only a minor impact on accuracy.
The research has been accepted for presentation at Interspeech 2025 with additional results, indicating its significance in advancing QA systems to include spoken interactions with visual content.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Nimrod Shabtay, Zvi Kons, Avihu Dekel, Hagai Aronowitz, Ron Hoory, Assaf Arbelle

arXiv: 2505.23308v1 - DOI (eess.AS)

Accepted for Interspeech 2025 (with additional results)

License: CC BY 4.0

Abstract: Question answering (QA) systems are designed to answer natural language questions. Visual QA (VQA) and Spoken QA (SQA) systems extend the textual QA system to accept visual and spoken input respectively. This work aims to create a system that enables user interaction through both speech and images. That is achieved through the fusion of text, speech, and image modalities to tackle the task of spoken VQA (SVQA). The resulting multi-modal model has textual, visual, and spoken inputs and can answer spoken questions on images. Training and evaluating SVQA models requires a dataset for all three modalities, but no such dataset currently exists. We address this problem by synthesizing VQA datasets using two zero-shot TTS models. Our initial findings indicate that a model trained only with synthesized speech nearly reaches the performance of the upper-bounding model trained on textual QAs. In addition, we show that the choice of the TTS model has a minor impact on accuracy.

Submitted to arXiv on 29 May. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2505.23308v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

The paper "Spoken question answering for visual queries" by Nimrod Shabtay, Zvi Kons, Avihu Dekel, Hagai Aronowitz, Ron Hoory, and Assaf Arbelle from IBM Research in Israel and Tel-Aviv University addresses the development of a system that enables user interaction through both speech and images. This system tackles the task of spoken Visual Question Answering (SVQA) by fusing text, speech, and image modalities to create a multi-modal model capable of answering spoken questions about images. One of the key challenges in training and evaluating SVQA models is the lack of datasets that encompass all three modalities. To address this issue, the authors synthesized VQA datasets using two zero-shot Text-to-Speech (TTS) models. Their initial findings suggest that a model trained solely with synthesized speech can achieve performance levels close to those of models trained on textual QAs. Furthermore, they demonstrate that the choice of TTS model has only a minor impact on accuracy. The authors also mention that their work has been accepted for presentation at Interspeech 2025 with additional results. Overall, this research contributes to advancing the field of QA systems by expanding capabilities to include spoken interactions with visual content.

- The paper discusses the development of a system for spoken Visual Question Answering (SVQA) that integrates speech and image modalities.
- The authors address the challenge of training and evaluating SVQA models due to the lack of datasets encompassing text, speech, and images.
- They synthesized VQA datasets using two zero-shot Text-to-Speech (TTS) models to overcome this issue.
- Initial findings suggest that a model trained solely with synthesized speech can achieve performance levels similar to those trained on textual QAs.
- The choice of TTS model was found to have only a minor impact on accuracy.
- The research has been accepted for presentation at Interspeech 2025 with additional results, indicating its significance in advancing QA systems to include spoken interactions with visual content.

Summary- The paper talks about making a system that can answer questions using both talking and pictures. - The authors explain how it is hard to teach and test these systems because there are not many datasets with text, speech, and images together. - They combined different question-answer sets using two special talking models to solve this problem. - They found out that a model trained only with talking can do as well as those trained on written questions. - The type of talking model used didn't make a big difference in how accurate the answers were. Definitions- Spoken Visual Question Answering (SVQA): A system that can answer questions using speech and images together. - Modality: Different ways of communicating or expressing information, like speaking or showing pictures. - Zero-shot Text-to-Speech (TTS) models: Models that can turn written words into spoken words without needing specific training data for each word.

The Development of a Spoken Question Answering System for Visual Queries

In today's digital age, the use of visual content has become increasingly prevalent in our daily lives. From social media platforms to e-commerce websites, images and videos are used to convey information and engage users. However, traditional methods of interacting with these visuals through text-based queries can be limiting. This is where spoken question answering (SQA) systems come into play. A team of researchers from IBM Research in Israel and Tel-Aviv University have recently published a paper titled "Spoken question answering for visual queries" that addresses the development of a system capable of answering spoken questions about images. The paper, authored by Nimrod Shabtay, Zvi Kons, Avihu Dekel, Hagai Aronowitz, Ron Hoory, and Assaf Arbelle, presents their work on creating a multi-modal model that fuses text, speech, and image modalities to enable user interaction through both speech and images.

The Challenge: Lack of Multi-Modal Datasets

One of the key challenges in training and evaluating SVQA models is the lack of datasets that encompass all three modalities - text, speech, and images. To address this issue, the authors synthesized VQA datasets using two zero-shot Text-to-Speech (TTS) models. These TTS models were trained on large-scale textual QA datasets such as SQuAD 1.1 and VQA v2.0. The synthesized dataset consisted of over 100K QA pairs with corresponding images from COCO-QA dataset. The authors also mention that their work has been accepted for presentation at Interspeech 2025 with additional results.

Results: Synthesized Speech Achieves High Performance Levels

The initial findings from this research suggest that a model trained solely with synthesized speech can achieve performance levels close to those of models trained on textual QAs. This is a significant finding as it demonstrates the potential for using synthesized speech in SVQA systems, eliminating the need for large-scale datasets with spoken questions. Furthermore, the authors also demonstrate that the choice of TTS model has only a minor impact on accuracy. This means that different TTS models can be used interchangeably without affecting the overall performance of the SVQA system.

Implications and Future Work

The development of a multi-modal model capable of answering spoken questions about images has several implications for advancing the field of QA systems. Firstly, it expands capabilities to include spoken interactions with visual content, making it more accessible and user-friendly. Secondly, by using synthesized speech instead of recorded human speech, this research reduces the dependency on large-scale datasets and makes it easier to scale up these systems. In terms of future work, there are several avenues that this research could explore. One potential direction is to incorporate natural language processing (NLP) techniques into their multi-modal model to improve its understanding and response generation capabilities. Additionally, further experiments could be conducted to evaluate how well this system performs in real-world scenarios with diverse users and accents.

Conclusion

In conclusion, "Spoken question answering for visual queries" by Nimrod Shabtay et al., presents an innovative approach towards developing a multi-modal model capable of answering spoken questions about images. Their use of synthesized speech in training their model highlights its potential in reducing dependencies on large-scale datasets while achieving high levels of performance. With further advancements and improvements, this research could pave the way for more user-friendly and efficient QA systems that incorporate both text-based and spoken interactions with visual content.

Created on 01 Jun. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

58.5%

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

eess.AS

57.5%

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Mod…

eess.AS

56.5%

StreamVC: Real-Time Low-Latency Voice Conversion

eess.AS

56.4%

On Metric Learning for Audio-Text Cross-Modal Retrieval

eess.AS

55.5%

Personalized Automatic Speech Recognition Trained on Small Disordered Speech …

eess.AS

54.2%

Speech Disorder Classification Using Extended Factorized Hierarchical Variati…

eess.AS

53.5%

Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignm…

eess.AS

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.