VideoPoet: A Large Language Model for Zero-Shot Video Generation
AI-generated Key Points
- VideoPoet is a language model that generates high-quality videos with matching audio using different types of input signals.
- It utilizes a decoder-only transformer architecture to process multimodal inputs such as images, videos, text, and audio.
- The model undergoes two stages of training: pretraining and task-specific adaptation.
- During pretraining, VideoPoet incorporates generative objectives within an autoregressive Transformer framework.
- VideoPoet demonstrates state-of-the-art capabilities in zero-shot video generation by producing temporally coherent videos with dynamic and meaningful motion.
- Despite only seeing a short temporal context, VideoPoet can predict future frames while maintaining consistency in object motion, style, and identity across more than 8 seconds of video output.
- Training on videos, images, and text enables VideoPoet to understand various aspects of the world including 3D structures, camera motions, and visual styles learned from different sources.
- Without specific training data or losses for encouraging 3D consistency, VideoPoet can accurately rotate around objects and visualize their backside.
- Using short text prompts, VideoPoet can apply a range of camera motions and incorporate different visual styles into its generated videos.
- The ability to combine different styles highlights VideoPoet's understanding of objects in a temporal context.
- VideoPoet showcases the potential of large language models trained on discrete visual and audio tokens for generating high-quality videos.
- Its strength lies in generating high fidelity motions in large and complex videos.
- The unified architecture and vocabulary used during training allow the pretrained model to excel at multi-task video creation and serve as a foundation for various video-related capabilities like editing.
Authors: Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, Yong Cheng, Ming-Chang Chiu, Josh Dillon, Irfan Essa, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, David Ross, Grant Schindler, Mikhail Sirotenko, Kihyuk Sohn, Krishna Somandepalli, Huisheng Wang, Jimmy Yan, Ming-Hsuan Yang, Xuan Yang, Bryan Seybold, Lu Jiang
Abstract: We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.