OmniParser for Pure Vision Based GUI Agent

AI-generated keywords: OMNIPARSER

AI-generated Key Points

  • OMNIPARSER is a vision-only approach designed to parse UI screenshots into structured elements
  • Consists of two fine-tuned models: icon detection model and functional description model
  • Curated datasets for interactable region detection and icon functional description training
  • Performance enhancement of GPT-4V on ScreenSpot benchmarks with parsed results from OMNIPARSER
  • Outperforms GPT-4V agents relying on HTML-extracted information on Mind2Web benchmark and surpasses GPT-4V augmented with specialized Android icon detection model on AITW benchmark
  • Goal is to provide a versatile tool for parsing UI screens across PC and mobile platforms without additional information like HTML or view hierarchy in Android
  • Addresses limitations in creating widely usable agents across multiple platforms and applications
  • Vision-based screen parsing technique bridges understanding of basic UI elements and grounding actions in various operating systems and applications
  • Reliable vision-based screen parsing method crucial for enhancing robustness of agentic workflows in diverse user tasks
  • OMNIPARSER extracts information from UI screenshots, providing structured bounding boxes and labels to improve GPT-4V's performance in action prediction across different user tasks
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah

License: CC BY 4.0

Abstract: The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a general agent on multiple operating systems across different applications is largely underestimated due to the lack of a robust screen parsing technique capable of: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen. To fill these gaps, we introduce \textsc{OmniParser}, a comprehensive method for parsing user interface screenshots into structured elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. We first curated an interactable icon detection dataset using popular webpages and an icon description dataset. These datasets were utilized to fine-tune specialized models: a detection model to parse interactable regions on the screen and a caption model to extract the functional semantics of the detected elements. \textsc{OmniParser} significantly improves GPT-4V's performance on ScreenSpot benchmark. And on Mind2Web and AITW benchmark, \textsc{OmniParser} with screenshot only input outperforms the GPT-4V baselines requiring additional information outside of screenshot.

Submitted to arXiv on 01 Aug. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2408.00203v1

, , , , In this report, we introduce OMNIPARSER, a comprehensive vision-only approach designed to parse user interface (UI) screenshots into structured elements. The system consists of two fine-tuned models: an icon detection model and a functional description model. To train these models, we curated an interactable region detection dataset using popular webpages and an icon functional description dataset. By leveraging the parsed results from OMNIPARSER, the performance of GPT-4V is significantly enhanced on ScreenSpot benchmarks. Notably, OMNIPARSER outperforms GPT-4V agents that rely on HTML-extracted information on the Mind2Web benchmark and surpasses GPT-4V augmented with a specialized Android icon detection model on the AITW benchmark. Our goal with OMNIPARSER is to provide a versatile and user-friendly tool capable of parsing UI screens across both PC and mobile platforms without the need for additional information such as HTML or view hierarchy in Android. This approach addresses the current limitations in creating widely usable agents that can operate seamlessly across multiple platforms and applications. While previous works have focused on specific applications or platforms, our vision-based screen parsing technique aims to bridge the gap between understanding basic UI elements and accurately grounding actions in various operating systems and applications. We argue that existing pure vision-based screen parsing techniques are inadequate, leading to an underestimation of the capabilities of models like GPT-4V. A reliable vision-based screen parsing method is crucial for enhancing the robustness of agentic workflows in diverse user tasks. OMNIPARSER serves as a general solution for extracting information from UI screenshots, providing structured bounding boxes and labels that improve GPT-4V's performance in action prediction across different user tasks. Overall, our contributions include curating an interactable region detection dataset based on bounding boxes extracted from DOM trees of popular webpages, developing specialized models for icon detection and functional description, and demonstrating significant performance improvements on benchmark tests. We envision OMNIPARSER as a valuable tool for advancing research in agent systems operating on UI interfaces and facilitating seamless interactions across multiple platforms and applications.
Created on 17 Feb. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.