VideoPoet: A Large Language Model for Zero-Shot Video Generation

AI-generated keywords: VideoPoet Transformer Pretraining Multimodal Generative

AI-generated Key Points

  • VideoPoet is a language model that generates high-quality videos with matching audio using different types of input signals.
  • It utilizes a decoder-only transformer architecture to process multimodal inputs such as images, videos, text, and audio.
  • The model undergoes two stages of training: pretraining and task-specific adaptation.
  • During pretraining, VideoPoet incorporates generative objectives within an autoregressive Transformer framework.
  • VideoPoet demonstrates state-of-the-art capabilities in zero-shot video generation by producing temporally coherent videos with dynamic and meaningful motion.
  • Despite only seeing a short temporal context, VideoPoet can predict future frames while maintaining consistency in object motion, style, and identity across more than 8 seconds of video output.
  • Training on videos, images, and text enables VideoPoet to understand various aspects of the world including 3D structures, camera motions, and visual styles learned from different sources.
  • Without specific training data or losses for encouraging 3D consistency, VideoPoet can accurately rotate around objects and visualize their backside.
  • Using short text prompts, VideoPoet can apply a range of camera motions and incorporate different visual styles into its generated videos.
  • The ability to combine different styles highlights VideoPoet's understanding of objects in a temporal context.
  • VideoPoet showcases the potential of large language models trained on discrete visual and audio tokens for generating high-quality videos.
  • Its strength lies in generating high fidelity motions in large and complex videos.
  • The unified architecture and vocabulary used during training allow the pretrained model to excel at multi-task video creation and serve as a foundation for various video-related capabilities like editing.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, Yong Cheng, Ming-Chang Chiu, Josh Dillon, Irfan Essa, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, David Ross, Grant Schindler, Mikhail Sirotenko, Kihyuk Sohn, Krishna Somandepalli, Huisheng Wang, Jimmy Yan, Ming-Hsuan Yang, Xuan Yang, Bryan Seybold, Lu Jiang

Project page: http://sites.research.google/videopoet/
License: CC BY 4.0

Abstract: We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

Submitted to arXiv on 21 Dec. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2312.14125v1

VideoPoet is a language model that can generate high-quality videos with matching audio using various types of input signals. It utilizes a decoder-only transformer architecture to process multimodal inputs such as images, videos, text and audio. The model undergoes two stages of training: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of generative objectives within an autoregressive Transformer framework. This pretrained model serves as a foundation for adapting to different video generation tasks and demonstrates its state-of-the-art capabilities in zero-shot video generation by producing temporally coherent videos with dynamic and meaningful motion. Despite only being able to view a short temporal context, VideoPoet can predict future frames while maintaining the consistency of object motion, style and identity across more than 8 seconds of video output. VideoPoet's training on videos, images and text enables it to understand various aspects of the world including 3D structures, camera motions and visual styles learned from different sources. Even without specific training data or losses for encouraging 3D consistency, the model can rotate around objects and visualize their backside accurately. Additionally, by using short text prompts VideoPoet can apply a range of camera motions to image-to-video and text-to-video generations. The model can also incorporate different visual styles such as watercolor or oil paintings into its generated videos; these stylization training sources primarily come from the text-image training data. The ability to combine these different styles highlights VideoPoet's understanding of objects in a temporal context. In conclusion, VideoPoet showcases the potential of large language models trained on discrete visual and audio tokens for generating high quality videos. Its strength lies in generating high fidelity motions in large and complex videos; the unified architecture and vocabulary used during training allow the pretrained model to excel at multi task video creation and serve as a foundation for various video related capabilities like editing.
Created on 22 Dec. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.