{"ID":2828360,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.14099","arxiv_id":"2512.14099","title":"ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models","abstract":"Motivated by discrete diffusion's success in language-vision modeling, we explore its potential for multi-view generation, a task dominated by continuous approaches. We introduce ViewMask-1-to-3, formulating multi-view synthesis as a discrete sequence modeling problem where each viewpoint is represented as visual tokens from MAGVIT-v2. Through masked token prediction, our approach enables progressive multi-view generation via iterative token unmasking, unifying language and vision in a shared token space. Importantly, simple random masking combined with self-attention naturally encourages cross-view consistency without specialized architectures or 3D geometric priors. Our method outperforms the baseline on the GSO and 3D-FUTURE benchmarks, ranking first on average across standard image metrics and improving IoU by 10.6% on 3D-FUTURE. This validates discrete diffusion as a promising candidate for multi-view generation.","short_abstract":"Motivated by discrete diffusion's success in language-vision modeling, we explore its potential for multi-view generation, a task dominated by continuous approaches. We introduce ViewMask-1-to-3, formulating multi-view synthesis as a discrete sequence modeling problem where each viewpoint is represented as visual token...","url_abs":"https://arxiv.org/abs/2512.14099","url_pdf":"https://arxiv.org/pdf/2512.14099v2","authors":"[\"Ruishu Zhu\",\"Zhihao Huang\",\"Jiacheng Sun\",\"Ping Luo\",\"Hongyuan Zhang\",\"Xuelong Li\"]","published":"2025-12-16T05:15:07Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\"]","has_code":false}