{"ID":2832064,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.06905","arxiv_id":"2512.06905","title":"Scaling Zero-Shot Reference-to-Video Generation","abstract":"Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.","short_abstract":"Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scal...","url_abs":"https://arxiv.org/abs/2512.06905","url_pdf":"https://arxiv.org/pdf/2512.06905v1","authors":"[\"Zijian Zhou\",\"Shikun Liu\",\"Haozhe Liu\",\"Haonan Qiu\",\"Zhaochong An\",\"Weiming Ren\",\"Zhiheng Liu\",\"Xiaoke Huang\",\"Kam Woh Ng\",\"Tian Xie\",\"Xiao Han\",\"Yuren Cong\",\"Hang Li\",\"Chuyan Zhu\",\"Aditya Patel\",\"Tao Xiang\",\"Sen He\"]","published":"2025-12-07T16:10:25Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}