Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion

¹Zhejiang University, ²vivo Mobile Communication Co., Ltd

Abstract

Recent advances in diffusion based editing models have enabled realistic camera simulation and image-based bokeh, but video bokeh remains largely unexplored. Existing video editing models cannot explicitly control focus planes or adjust bokeh intensity, limiting their applicability for controllable optical effects. Moreover, naively extending image-based bokeh methods to video often results in temporal flickering and unsatisfactory edge blur transitions due to the lack of temporal modeling and generalization capability. To address these challenges, we propose a novel one-step video bokeh framework that converts arbitrary input videos into temporally coherent, depth-aware bokeh effects. Our method leverages a multi-plane image (MPI) representation constructed through a progressively widening depth sampling function, providing explicit geometric guidance for depth-dependent blur synthesis. By conditioning a single-step video diffusion model on MPI layers and utilizing the strong 3D priors from pre-trained models such as Stable Video Diffusion, our approach achieves realistic and consistent bokeh effects across diverse scenes. Additionally, we introduce a progressive training strategy to enhance temporal consistency, depth robustness, and detail preservation. Extensive experiments demonstrate that our method produces high-quality, controllable bokeh effects and achieves state-of-the-art performance on multiple evaluation benchmarks.

Method

1. Two key components of Any-to-Bokeh:

a) One-step video bokeh model architecture: receives input of any video and disparity relative to the focal plane to perform bokeh effect. b) MPI spatial block: uses the MPI mask \(\mathcal{M}\) to prompt MPI attention to focus on areas at different depths from the focal plane, guiding bokeh rendering. Additionally, high-level semantic information is injected via cross-attention to preserve more semantic structures. The user-defined blur strength \(K\) is injected through embedding.

2. Progressive Training Strategy:

We approach a three-stage training strategy to improve temporal consistency, depth robustness, and fine detail preservation.

Stage 1: Train the whole U-Net and adapters.
Stage2: Refine temporal block with disturbance.
Stage 3: Fine-tuning VAE decoder.

BibTeX

@misc{yang2025anytobokehonestepvideobokeh, title={Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion}, author={Yang Yang and Siming Zheng and Jinwei Chen and Boxi Wu and Xiaofei He and Deng Cai and Bo Li and Peng-Tao Jiang}, year={2025}, eprint={2505.21593}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.21593}, }