{"ID":2829095,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.13609","arxiv_id":"2512.13609","title":"Do-Undo Bench: Reversibility for Action Understanding in Image Generation","abstract":"We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware generation in multimodal systems that must reason about real-world dynamics.","short_abstract":"We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our tra...","url_abs":"https://arxiv.org/abs/2512.13609","url_pdf":"https://arxiv.org/pdf/2512.13609v2","authors":"[\"Shweta Mahajan\",\"Shreya Kadambi\",\"Hoang Le\",\"Rajeev Yasarla\",\"Apratim Bhattacharyya\",\"Munawar Hayat\",\"Fatih Porikli\"]","published":"2025-12-15T18:03:42Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}