FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

  1Leibniz University Hannover   2Meta AI   3The University of Hong Kong   4Nanyang Technological University

Abstract

Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model’s U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.

Method

In this work, we propose a novel (optical) FLow-guided ATTENtion that seamlessly integrates with text-to-image diffusion models and implicitly leverages optical flow for text-to-video editing to address the visual consistency limitation in previous works. FLATTEN enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency of the edited video. The main advantage of our method is that enables the information to communicate accurately across multiple frames guided by optical flow, which stabilises the prompt-generated visual content of the edited videos. Furthermore, our proposed method can be easily integrated into other diffusion-based text-to-video editing methods and improve the visual consistency of their edited videos. We present a T2V editing framework utilizing FLATTEN as a foundation. Our training-free approach achieves high-quality and highly consistent text-to-video editing.

Qualitative Results

Source video "A cat, Van Gogh style." "A woolen toy cat." "Pixar animation." "A tiger"
Source video "A metal sculpture." "A husky." "A dirty tiger." "A cute pig"
Source video "Pointillism painting, detailed." Source video "A detailed woolen toy cat."
Source video "A car drifts on a snowy road." Source video "Cartoon Style."

Plug-and-Play FLATTEN

Source video ControlVideo ControlVideo+FLATTEN

Comparison with other T2V editing methods

Editing Prompt: "Wooden trucks drive on a racetrack."
Source video Ours Tune-A-Video FateZero Text2Video-Zero ControlVideo TokenFlow

BibTeX

@article{cong2023flatten,
  title={FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing},
  author={Cong, Yuren and Xu, Mengmeng and Simon, Christian and Chen, Shoufa and Ren, Jiawei and Xie, Yanping and Perez-Rua, Juan-Manuel and Rosenhahn, Bodo and Xiang, Tao and He, Sen},
  journal={arXiv preprint arXiv:2310.05922},
  year={2023}
}