DARE: Diverse Visual Question Answering with Robustness Evaluation

AI-generated keywords: DARE

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors explore capabilities and limitations of Vision Language Models (VLMs)
  • VLMs excel in standard image classification and image-text matching tasks
  • Challenges in crucial vision-language reasoning abilities like counting and spatial reasoning
  • Existing benchmarks do not adequately assess the robustness of VLMs
  • Introduction of DARE, a multiple-choice Visual Question Answering (VQA) benchmark to evaluate VLM performance across five diverse categories
  • State-of-the-art VLMs struggle with questions across most categories and fail to consistently deliver peak performance under different robustness evaluations
  • Open-source VLMs exhibit lower robustness compared to closed-source models
  • Need for improved evaluation metrics to accurately assess the robustness of VLMs
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Hannah Sterz, Jonas Pfeiffer, Ivan Vulić

Abstract: Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models, and are able to learn from and process multi-modal vision-text input. While modern VLMs perform well on a number of standard image classification and image-text matching tasks, they still struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning. Moreover, while they might be very brittle to small variations in instructions and/or evaluation protocols, existing benchmarks fail to evaluate their robustness (or rather the lack of it). In order to couple challenging VL scenarios with comprehensive robustness evaluation, we introduce DARE, Diverse Visual Question Answering with Robustness Evaluation, a carefully created and curated multiple-choice VQA benchmark. DARE evaluates VLM performance on five diverse categories and includes four robustness-oriented evaluations based on the variations of: prompts, the subsets of answer options, the output format and the number of correct answers. Among a spectrum of other findings, we report that state-of-the-art VLMs still struggle with questions in most categories and are unable to consistently deliver their peak performance across the tested robustness evaluations. The worst case performance across the subsets of options is up to 34% below the performance in the standard case. The robustness of the open-source VLMs such as LLaVA 1.6 and Idefics2 cannot match the closed-source models such as GPT-4 and Gemini, but even the latter remain very brittle to different variations.

Submitted to arXiv on 26 Sep. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2409.18023v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

, , , , In their paper titled "DARE: Diverse Visual Question Answering with Robustness Evaluation," authors Hannah Sterz, Jonas Pfeiffer, and Ivan Vulić explore the capabilities and limitations of Vision Language Models (VLMs). These models combine text-only large language models with vision-only models to process multi-modal vision-text input. While VLMs excel in standard image classification and image-text matching tasks, they face challenges in crucial vision-language reasoning abilities like counting and spatial reasoning. The authors highlight that existing benchmarks do not adequately assess the robustness of VLMs, leading to inconsistencies in performance under varying conditions. To address this gap, the authors introduce DARE, a comprehensive multiple-choice Visual Question Answering (VQA) benchmark designed to evaluate VLM performance across five diverse categories. DARE includes four robustness-oriented evaluations that test variations in prompts, answer options subsets, output format, and correct answers. Through their experiments, the authors find that state-of-the-art VLMs struggle with questions across most categories and fail to consistently deliver peak performance under different robustness evaluations. In fact, the worst-case performance can be up to 34% lower than in standard scenarios. Furthermore, the study reveals that open-source VLMs like LLaVA 1.6 and Idefics2 exhibit lower robustness compared to closed-source models such as GPT-4 and Gemini. However, even these closed-source models demonstrate brittleness when faced with diverse variations in inputs. Overall, the findings underscore the need for improved evaluation metrics to assess the robustness of VLMs accurately and enhance their performance across challenging vision-language reasoning tasks.
Created on 27 Sep. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.