{"ID":2839713,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.15831","arxiv_id":"2511.15831","title":"UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment","abstract":"Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance. The source code and pretrained models are available at https://github.com/zwplus/UniFit.","short_abstract":"Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by t...","url_abs":"https://arxiv.org/abs/2511.15831","url_pdf":"https://arxiv.org/pdf/2511.15831v2","authors":"[\"Wei Zhang\",\"Yeying Jin\",\"Xin Li\",\"Yan Zhang\",\"Xiaofeng Cong\",\"Cong Wang\",\"Fengcai Qiao\",\"zhichao Lian\"]","published":"2025-11-19T19:38:44Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":606900,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2839713,"paper_url":"https://arxiv.org/abs/2511.15831","paper_title":"UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment","repo_url":"https://github.com/zwplus/UniFit","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}