Lexi: Self-Supervised Learning of the UI Language

AI-generated keywords: Visio-linguistic UI Representation Lexi UICaption Transformer-based

AI-generated Key Points

  • A need for generic visio-linguistic representations of UI screens and their components
  • Existing UI representation models rely on UI metadata which may not always be available or accessible
  • A new approach has been proposed to leverage data from instruction manuals and how-to guides to learn these representations
  • The approach involves the use of a pre-trained vision and language model called Lexi, designed to handle the unique features of UI screens
  • To train Lexi, a new dataset called UICaption was curated consisting of 114k UI images paired with descriptions of their functionality
  • The performance of Lexi was evaluated on four tasks and outperformed existing models in all tasks
  • Visio-linguistic (VL) models use high-quality paired datasets such as Conceptual Captions, SBU Captions, and MS COCO
  • Future work should focus on expanding this method to different languages beyond English instructions and evaluating its performance on longer instructions
  • Pre-training the model requires significant GPU resources making it difficult to execute on edge devices like mobile phones raising privacy concerns
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Pratyay Banerjee, Shweti Mahajan, Kushal Arora, Chitta Baral, Oriana Riva

EMNLP (Findings) 2022
License: CC BY 4.0

Abstract: Humans can learn to operate the user interface (UI) of an application by reading an instruction manual or how-to guide. Along with text, these resources include visual content such as UI screenshots and images of application icons referenced in the text. We explore how to leverage this data to learn generic visio-linguistic representations of UI screens and their components. These representations are useful in many real applications, such as accessibility, voice navigation, and task automation. Prior UI representation models rely on UI metadata (UI trees and accessibility labels), which is often missing, incompletely defined, or not accessible. We avoid such a dependency, and propose Lexi, a pre-trained vision and language model designed to handle the unique features of UI screens, including their text richness and context sensitivity. To train Lexi we curate the UICaption dataset consisting of 114k UI images paired with descriptions of their functionality. We evaluate Lexi on four tasks: UI action entailment, instruction-based UI image retrieval, grounding referring expressions, and UI entity recognition.

Submitted to arXiv on 23 Jan. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2301.10165v1

In the field of user interface (UI) language understanding, there is a need for generic visio-linguistic representations of UI screens and their components that can be used in various applications such as accessibility, voice navigation, and task automation. While existing UI representation models rely on UI metadata which may not always be available or accessible, a new approach has been proposed to leverage data from instruction manuals and how-to guides to learn these representations. This approach involves the use of a pre-trained vision and language model called Lexi, which is designed to handle the unique features of UI screens such as their text richness and context sensitivity. To train Lexi, a new dataset called UICaption was curated consisting of 114k UI images paired with descriptions of their functionality. The performance of Lexi was evaluated on four tasks including UI action entailment, instruction-based UI image retrieval, grounding referring expressions, and UI entity recognition. The results showed that Lexi outperformed existing models in all tasks. Related work in this area includes multiple transformer-based architectures proposed to learn a single feature space from visual and language inputs. However, unlike pre-training language models using unlimited natural language texts like Wikipedia, visio-linguistic (VL) models use high-quality paired datasets such as Conceptual Captions, SBU Captions, and MS COCO. This new approach extends prior architectures with a vision encoder designed to leverage the text richness of UI images. While instruction manuals are useful resources for learning how to operate an application's user interface (UI), they may not always be available or complete for certain application categories like shopping or transportation. Additionally, curating pre-training data from the web presents challenges in ensuring zero overlap between pre-training and downstream evaluation tasks. Future work should focus on expanding this method to different languages beyond English instructions and evaluating its performance on longer instructions. Pre-training the model requires significant GPU resources making it difficult to execute on edge devices like mobile phones raising privacy concerns; however it would be highly desirable for scenarios motivating this work such as accessibility support and voice navigation in mobile apps to execute and update the models locally. Overall, this new approach offers a promising solution for learning generic visio-linguistic representations of UI screens without relying on metadata while also avoiding overlap between pre-training data sources and downstream evaluation tasks.
Created on 18 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.