{"ID":2877329,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.21070","arxiv_id":"2508.21070","title":"Dress\u0026Dance: Dress up and Dance as You Like It - Technical Preview","abstract":"We present Dress\u0026Dance, a video diffusion framework that generates high quality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a user wearing desired garments while moving in accordance with a given reference video. Our approach requires a single user image and supports a range of tops, bottoms, and one-piece garments, as well as simultaneous tops and bottoms try-on in a single pass. Key to our framework is CondNet, a novel conditioning network that leverages attention to unify multi-modal inputs (text, images, and videos), thereby enhancing garment registration and motion fidelity. CondNet is trained on heterogeneous training data, combining limited video data and a larger, more readily available image dataset, in a multistage progressive manner. Dress\u0026Dance outperforms existing open source and commercial solutions and enables a high quality and flexible try-on experience.","short_abstract":"We present Dress\u0026Dance, a video diffusion framework that generates high quality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a user wearing desired garments while moving in accordance with a given reference video. Our approach requires a single user image and supports a range of tops, bottoms, a...","url_abs":"https://arxiv.org/abs/2508.21070","url_pdf":"https://arxiv.org/pdf/2508.21070v1","authors":"[\"Jun-Kun Chen\",\"Aayush Bansal\",\"Minh Phuoc Vo\",\"Yu-Xiong Wang\"]","published":"2025-08-28T17:59:55Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.LG\"]","methods":"[\"Diffusion Model\"]","has_code":false}