MMSkills: Towards Multimodal Skills for General Visual Agents

AI-generated keywords: Reusable Skills

AI-generated Key Points

Reusable skills are crucial for improving agent capabilities
MMSkills framework introduced to represent, generate, and utilize reusable multimodal procedures for real-time visual decision-making
Each MMSkill combines textual procedure with runtime state cards and multi-view keyframes
Experiments show that MMSkills enhance performance of both frontier and smaller multimodal agents in visual decision-making tasks

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Kangning Zhang, Shuai Shao, Qingyao Li, Jianghao Lin, Lingyue Fu, Shijian Wang, Wenxiang Jiao, Yuan Lu, Weiwen Liu, Weinan Zhang, Yong Yu

arXiv: 2605.13527v1 - DOI (cs.AI)

25 pages, 8 figures, 8 tables. Project page: https://zkangning.github.io/towards_mmskills

License: CC BY 4.0

Abstract: Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.

Submitted to arXiv on 13 May. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2605.13527v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In the realm of improving agent capabilities, reusable skills have emerged as a fundamental component. However, traditional skill packages often encode reusable behavior in the form of textual prompts, executable code, or learned routines. To address this need for multimodal procedural knowledge, we introduce MMSkills—a framework designed to represent, generate, and utilize reusable multimodal procedures for real-time visual decision-making. Each MMSkill encapsulates a compact package that combines a textual procedure with runtime state cards and multi-view keyframes. Our experiments across GUI and game-based visual-agent benchmarks demonstrate that MMSkills consistently enhance both frontier and smaller multimodal agents. This suggests that external multimodal procedural knowledge complements internal model priors effectively. Overall, our work emphasizes the importance of incorporating multimodal procedural knowledge into agent capabilities and showcases how MMSkills can significantly improve performance in visual decision-making tasks.

- Reusable skills are crucial for improving agent capabilities
- MMSkills framework introduced to represent, generate, and utilize reusable multimodal procedures for real-time visual decision-making
- Each MMSkill combines textual procedure with runtime state cards and multi-view keyframes
- Experiments show that MMSkills enhance performance of both frontier and smaller multimodal agents in visual decision-making tasks

Summary1. Skills that can be used again and again are very important for making agents better. 2. A new framework called MMSkills helps to create and use these reusable skills for making quick decisions based on what is seen. 3. Each MMSkill mixes written steps with cards showing the current situation and different views of key moments. 4. Tests prove that using MMSkills makes both big and small agents do better in tasks where they need to make decisions based on what they see. Definitions- Reusable: Something that can be used more than once. - Agent: A computer program or robot that can do tasks on its own. - Framework: A structure or plan for doing something. - Multimodal: Using more than one way, like text and pictures, to understand things. - Procedures: Steps or actions to follow in a certain order. - Real-time: Happening immediately without delay. - Visual decision-making: Making choices based on what you see rather than just words or numbers. - Experiments: Tests or trials done to learn something new.

Introduction

In recent years, there has been a growing interest in developing intelligent agents that can perform complex tasks by utilizing reusable skills. These skills allow agents to efficiently learn and adapt to new environments, making them more versatile and capable of handling a wide range of tasks. However, traditional skill packages often rely on textual prompts or executable code, limiting their applicability in real-time visual decision-making scenarios. To address this limitation, researchers at the University of California, Berkeley have introduced MMSkills – a framework designed to represent, generate, and utilize reusable multimodal procedures for visual decision-making. This groundbreaking research paper explores the potential of incorporating external multimodal procedural knowledge into agent capabilities and showcases how MMSkills can significantly improve performance in visual decision-making tasks.

The Need for Multimodal Procedural Knowledge

Traditional skill packages often encode reusable behavior in the form of textual prompts or executable code. While these methods may work well for simple tasks, they are not suitable for complex real-time visual decision-making scenarios where agents need to process information from multiple modalities simultaneously. For example, imagine an agent playing a video game where it needs to navigate through a maze while avoiding obstacles and collecting coins. In such a scenario, the agent would require both visual perception (to identify obstacles and coins) as well as motor control (to move through the maze). Traditional skill packages would struggle to handle such multi-modal tasks effectively.

The MMSkills Framework

To overcome this limitation, the researchers propose MMSkills – a framework that combines textual procedures with runtime state cards and multi-view keyframes. Each MMSkill encapsulates all necessary information required for an agent to complete a specific task successfully. The textual procedure provides high-level instructions on how to perform the task while state cards store relevant information about the environment's current state. The multi-view keyframes provide visual information from different perspectives, allowing the agent to make more informed decisions.

Experimental Results

To evaluate the effectiveness of MMSkills, the researchers conducted experiments on GUI and game-based visual-agent benchmarks. The results showed that agents equipped with MMSkills consistently outperformed both frontier and smaller multimodal agents. This suggests that external multimodal procedural knowledge complements internal model priors effectively. By incorporating MMSkills into their capabilities, agents can leverage pre-existing knowledge to improve performance in complex tasks significantly.

Conclusion

In conclusion, this research paper highlights the importance of incorporating multimodal procedural knowledge into agent capabilities. The introduction of MMSkills provides a framework for representing, generating, and utilizing reusable multimodal procedures for real-time visual decision-making. Through their experiments, the researchers have demonstrated how MMSkills can significantly enhance agent performance in complex tasks by leveraging external knowledge. The potential applications of this research are vast – from video games to robotics and autonomous vehicles. As technology continues to advance, intelligent agents will play an increasingly crucial role in our daily lives. And with frameworks like MMSkills at their disposal, these agents will be better equipped to handle complex tasks efficiently and adapt to new environments seamlessly.

Created on 19 May. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

59.2%

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

cs.AI

58.3%

EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms

cs.AI

55.4%

MMToM-QA: Multimodal Theory of Mind Question Answering

cs.AI

53.6%

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Langu…

cs.AI

53.5%

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Fo…

cs.AI

52.9%

Capabilities of Gemini Models in Medicine

cs.AI

51.9%

Flow: Modularized Agentic Workflow Automation

cs.AI

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.