Screenai: A Visual Language Model For Ui And Visually-situated Language Understanding

Posted by Srinivas Sunkara and Gilles Baechler, Software Engineers, Google Research

Screen personification interfaces (UIs) and infographics, specified arsenic charts, diagrams and tables, play important roles successful quality connection and human-machine relationship arsenic they facilitate rich | and interactive personification experiences. UIs and infographics stock akin creation principles and ocular connection (e.g., icons and layouts), that connection an opportunity to build a azygous exemplary that tin understand, reason, and interact pinch these interfaces. However, because of their complexity and varied position formats, infographics and UIs coming a unsocial modeling challenge.

To that end, we present “ScreenAI: A Vision-Language Model for UI and Infographics Understanding”. ScreenAI improves upon nan PaLI architecture pinch nan elastic patching strategy from pix2struct. We train ScreenAI connected a unsocial substance of datasets and tasks, including a caller Screen Annotation task that requires nan exemplary to place UI constituent accusation (i.e., type, location and description) connected a screen. These matter annotations supply ample connection models (LLMs) pinch surface descriptions, enabling them to automatically make question-answering (QA), UI navigation, and summarization training datasets astatine scale. At only 5B parameters, ScreenAI achieves state-of-the-art results connected UI- and infographic-based tasks (WebSRC and MoTIF), and best-in-class capacity connected Chart QA, DocVQA, and InfographicVQA compared to models of akin size. We are besides releasing 3 caller datasets: Screen Annotation to measure nan layout knowing capacity of nan model, arsenic good arsenic ScreenQA Short and Complex ScreenQA for a much broad information of its QA capability.

ScreenAI

ScreenAI’s architecture is based connected PaLI, composed of a multimodal encoder artifact and an autoregressive decoder. The PaLI encoder uses a vision transformer (ViT) that creates image embeddings and a multimodal encoder that takes nan concatenation of nan image and matter embeddings arsenic input. This elastic architecture allows ScreenAI to lick imagination tasks that tin beryllium recast arsenic text+image-to-text problems.

On apical of nan PaLI architecture, we employment a elastic patching strategy introduced successful pix2struct. Instead of utilizing a fixed-grid pattern, nan grid dimensions are selected specified that they sphere nan autochthonal facet ratio of nan input image. This enables ScreenAI to activity good crossed images of various facet ratios.

The ScreenAI exemplary is trained successful 2 stages: a pre-training shape followed by a fine-tuning stage. First, self-supervised learning is applied to automatically make information labels, which are past utilized to train ViT and nan connection model. ViT is stiff during nan fine-tuning stage, wherever astir information utilized is manually branded by quality raters.

ScreenAI exemplary architecture.

Data generation

To create a pre-training dataset for ScreenAI, we first compile an extended postulation of screenshots from various devices, including desktops, mobile, and tablets. This is achieved by utilizing publicly accessible web pages and pursuing nan programmatic exploration attack utilized for nan RICO dataset for mobile apps. We past use a layout annotator, based connected nan DETR model, that identifies and labels a wide scope of UI elements (e.g., image, pictogram, button, text) and their spatial relationships. Pictograms acquisition further study utilizing an icon classifier tin of distinguishing 77 different icon types. This elaborate classification is basal for interpreting nan subtle accusation conveyed done icons. For icons that are not covered by nan classifier, and for infographics and images, we usage nan PaLI image captioning exemplary to make descriptive captions that supply contextual information. We besides use an optical characteristic recognition (OCR) motor to extract and annotate textual contented connected screen. We harvester nan OCR matter pinch nan erstwhile annotations to create a elaborate explanation of each screen.

A mobile app screenshot pinch generated annotations that see UI elements and their descriptions, e.g., TEXT elements besides incorporate nan matter contented from OCR, IMAGE elements incorporate image captions, LIST_ITEMs incorporate each their kid elements.

LLM-based information generation

We heighten nan pre-training data's diverseness utilizing PaLM 2 to make input-output pairs successful a two-step process. First, surface annotations are generated utilizing nan method outlined above, past we trade a punctual astir this schema for nan LLM to create synthetic data. This process requires punctual engineering and iterative refinement to find an effective prompt. We measure nan generated data's value done quality validation against a value threshold.

You only speak JSON. Do not constitute matter that isn’t JSON. You are fixed nan pursuing mobile screenshot, described successful words. Can you make 5 questions regarding nan contented of nan screenshot arsenic good arsenic nan corresponding short answers to them? The reply should beryllium arsenic short arsenic possible, containing only nan basal information. Your reply should beryllium system arsenic follows: questions: [ {{question: nan question, answer: nan answer }}, ... ] {THE SCREEN SCHEMA}

A sample punctual for QA information generation.

By combining nan earthy connection capabilities of LLMs pinch a system schema, we simulate a wide scope of personification interactions and scenarios to make synthetic, realistic tasks. In particular, we make 3 categories of tasks:

Question answering: The exemplary is asked to reply questions regarding nan contented of nan screenshots, e.g., “When does nan edifice open?”
Screen navigation: The exemplary is asked to person a earthy connection utterance into an executable action connected a screen, e.g., “Click nan hunt button.”
Screen summarization: The exemplary is asked to summarize nan surface contented successful 1 aliases 2 sentences.

Block sketch of our workflow for generating information for QA, summarization and navigation tasks utilizing existing ScreenAI models and LLMs. Each task uses a civilization punctual to stress desired aspects, for illustration questions related to counting, involving reasoning, etc.

LLM-generated data. Examples for surface QA, navigation and summarization. For navigation, nan action bounding container is displayed successful reddish connected nan screenshot.

Experiments and results

As antecedently mentioned, ScreenAI is trained successful 2 stages: pre-training and fine-tuning. Pre-training information labels are obtained utilizing self-supervised learning and fine-tuning information labels comes from quality raters.

We fine-tune ScreenAI utilizing nationalist QA, summarization, and navigation datasets and a assortment of tasks related to UIs. For QA, we usage good established benchmarks successful nan multimodal and archive knowing field, specified arsenic ChartQA, DocVQA, Multi page DocVQA, InfographicVQA, OCR VQA, Web SRC and ScreenQA. For navigation, datasets utilized see Referring Expressions, MoTIF, Mug, and Android successful nan Wild. Finally, we usage Screen2Words for surface summarization and Widget Captioning for describing circumstantial UI elements. Along pinch nan fine-tuning datasets, we measure nan fine-tuned ScreenAI exemplary utilizing 3 caller benchmarks:

Screen Annotation: Enables nan information exemplary layout annotations and spatial knowing capabilities.
ScreenQA Short: A variety of ScreenQA, wherever its crushed truth answers person been shortened to incorporate only nan applicable accusation that amended aligns pinch different QA tasks.
Complex ScreenQA: Complements ScreenQA Short pinch much difficult questions (counting, arithmetic, comparison, and non-answerable questions) and contains screens pinch various facet ratios.

The fine-tuned ScreenAI exemplary achieves state-of-the-art results connected various UI and infographic-based tasks (WebSRC and MoTIF) and best-in-class capacity connected Chart QA, DocVQA, and InfographicVQA compared to models of akin size. ScreenAI achieves competitory capacity connected Screen2Words and OCR-VQA. Additionally, we study results connected nan caller benchmark datasets introduced to service arsenic a baseline for further research.

Comparing exemplary capacity of ScreenAI pinch state-of-the-art (SOTA) models of akin size.

Next, we analyse ScreenAI’s scaling capabilities and observe that crossed each tasks, expanding nan exemplary size improves performances and nan improvements person not saturated astatine nan largest size.

Model capacity increases pinch size, and nan capacity has not saturated moreover astatine nan largest size of 5B params.

Conclusion

We present nan ScreenAI exemplary on pinch a unified practice that enables america to create self-supervised learning tasks leveraging information from each these domains. We besides exemplify nan effect of information procreation utilizing LLMs and analyse improving exemplary capacity connected circumstantial aspects pinch modifying nan training mixture. We use each of these techniques to build multi-task trained models that execute competitively pinch state-of-the-art approaches connected a number of nationalist benchmarks. However, we besides statement that our attack still lags down ample models and further investigation is needed to span this gap.

Acknowledgements

This task is nan consequence of associated activity pinch Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen and Abhanshu Sharma. We convey Fangyu Liu, Xi Chen, Efi Kokiopoulou, Jesse Berent, Gabriel Barcik, Lukas Zilka, Oriana Riva, Gang Li,Yang Li, Radu Soricut, and Tania Bedrax-Weiss for their insightful feedback and discussions, on pinch Rahul Aralikatte, Hao Cheng and Daniel Kim for their support successful information preparation. We besides convey Jay Yagnik, Blaise Aguera y Arcas, Ewa Dominowska, David Petrou, and Matt Sharifi for their leadership, imagination and support. We are very grateful toTom Small for helping america create nan animation successful this post.