{"ID":2838319,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.18131","arxiv_id":"2511.18131","title":"Video4Edit: Viewing Image Editing as a Degenerate Temporal Process","abstract":"We observe that recent advances in multimodal foundation models have propelled instruction-driven image generation and editing into a genuinely cross-modal, cooperative regime. Nevertheless, state-of-the-art editing pipelines remain costly: beyond training large diffusion/flow models, they require curating massive high-quality triplets of \\{instruction, source image, edited image\\} to cover diverse user intents. Moreover, the fidelity of visual replacements hinges on how precisely the instruction references the target semantics. We revisit this challenge through the lens of temporal modeling: if video can be regarded as a full temporal process, then image editing can be seen as a degenerate temporal process. This perspective allows us to transfer single-frame evolution priors from video pre-training, enabling a highly data-efficient fine-tuning regime. Empirically, our approach matches the performance of leading open-source baselines while using only about one percent of the supervision demanded by mainstream editing models.","short_abstract":"We observe that recent advances in multimodal foundation models have propelled instruction-driven image generation and editing into a genuinely cross-modal, cooperative regime. Nevertheless, state-of-the-art editing pipelines remain costly: beyond training large diffusion/flow models, they require curating massive high...","url_abs":"https://arxiv.org/abs/2511.18131","url_pdf":"https://arxiv.org/pdf/2511.18131v1","authors":"[\"Xiaofan Li\",\"Yanpeng Sun\",\"Chenming Wu\",\"Fan Duan\",\"YuAn Wang\",\"Weihao Bo\",\"Yumeng Zhang\",\"Dingkang Liang\"]","published":"2025-11-22T17:30:55Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\"]","has_code":false}