JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

AI-generated keywords: JARVIS-1 Multimodal Planning Minecraft Self-Checking

AI-generated Key Points

  • JARVIS-1 is an open-world agent developed for the Minecraft universe
  • It can perceive multimodal input, generate plans, and perform control tasks
  • Utilizes pre-trained language models to map visual observations and textual instructions to plans
  • Equipped with a multimodal memory combining pre-trained knowledge and game survival experiences
  • Capable of completing over 200 different tasks in Minecraft, ranging from short to long horizon tasks
  • Outperforms existing state-of-the-art agents in long term tasks by five times
  • Demonstrates superior generalization and planning abilities
  • Method used does not rely on fine tuning through imitation learning or reinforcement learning
  • Maintains a high success rate even as the duration of the task increases
  • Incorporates self-checking mechanisms to ensure plan correctness
  • Can dynamically re-plan based on current inventory and game conditions
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang

arXiv admin note: text overlap with arXiv:2206.11795 by other authors
License: CC BY 4.0

Abstract: Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans. These tasks range from short-horizon tasks, e.g., "chopping trees" to long-horizon tasks, e.g., "obtaining a diamond pickaxe". JARVIS-1 performs exceptionally well in short-horizon tasks, achieving nearly perfect performance. In the classic long-term task of $\texttt{ObtainDiamondPickaxe}$, JARVIS-1 surpasses the reliability of current state-of-the-art agents by 5 times and can successfully complete longer-horizon and more challenging tasks. The project page is available at https://craftjarvis-jarvis1.github.io.

Submitted to arXiv on 10 Nov. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2311.05997v2

JARVIS-1 is an open-world agent developed for the Minecraft universe that can perceive multimodal input, generate plans and perform control tasks. It utilizes pre-trained language models to map visual observations and textual instructions to plans which are then executed by goal-conditioned controllers. The agent is equipped with a multimodal memory combining pre-trained knowledge and its own game survival experiences. JARVIS-1 is capable of completing over 200 different tasks in Minecraft ranging from short to long horizon tasks. It outperforms existing state-of-the art agents in long term tasks by five times and demonstrates superior generalization and planning abilities. The method used does not rely on fine tuning through imitation learning or reinforcement learning. Furthermore, it maintains a high success rate even as the duration of the task increases unlike other approaches that struggle with longer time frames. Additionally, JARVIS-1 incorporates self checking mechanisms to ensure plan correctness and can dynamically re plan based on its current inventory and game conditions.
Created on 24 Nov. 2023

Assess the quality of the AI-generated content by voting

Score: 1

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.