TokenFlow: Consistent Diffusion Features for Consistent Video Editing

AI-generated keywords: Generative AI

AI-generated Key Points

  • Generative AI has advanced in image generation but lags behind in video generation
  • Proposed framework uses pre-trained text-to-image model for text-driven editing of natural videos
  • Main challenge is ensuring consistency across all frames of edited video
  • Framework enforces original inter-frame video correspondences on the edit for consistency
  • Key contribution includes TokenFlow technique for enhancing temporal consistency in generated videos
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel

License: CC BY 4.0

Abstract: The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: https://diffusion-tokenflow.github.io/

Submitted to arXiv on 19 Jul. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2307.10373v1

, , , , The field of generative AI has made significant advancements in recent years, particularly in the realm of image generation. However, the transition to video generation has been slower, with current state-of-the-art video models still lagging behind their image counterparts in terms of visual quality and user control over generated content. In this work, a framework leveraging a pre-trained text-to-image model is proposed for text-driven editing of natural videos. The goal is to generate high-quality videos that align with a specified edit described by an input text prompt while maintaining the spatial layout and motion of the original video. The main challenge lies in ensuring consistency across all frames of the edited video, where each point in the 3D world undergoes coherent modifications over time. To address this challenge, the framework enforces original inter-frame video correspondences on the edit. By recognizing that natural videos contain redundant information across frames and that internal representations exhibit similar properties within diffusion models, consistency can be achieved by ensuring that edited features convey the same inter-frame correspondences as the original video features. This approach allows for consistent edits without additional training or fine-tuning and can be used alongside existing diffusion-based image editing methods. Key contributions include TokenFlow, a technique that enhances temporal consistency in videos generated by a text-to-image diffusion model through semantic correspondences of diffusion features across frames. Additionally, novel empirical analysis explores the properties of diffusion features across videos, leading to state-of-the-art editing results showcasing complex motions. In summary, this work presents a novel approach to text-driven video editing using a text-to-image diffusion model. By enforcing semantic correspondences of diffusion features across frames, temporal consistency is significantly improved in generated videos. The framework offers a valuable tool for professionals and semi-professionals seeking to create high-quality edited videos while preserving original spatial layout and motion dynamics.
Created on 01 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.