{"ID":2847172,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.00391","arxiv_id":"2511.00391","title":"VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning","abstract":"Multimodal code generation has garnered significant interest within the research community. Despite the notable success of recent vision-language models (VLMs) on specialized tasks like chart-to-code generation, their reliance on single-task training regimens fosters a narrow paradigm that hinders the development of generalized \\textbf{VI}sio\\textbf{N} \\textbf{C}ode \\textbf{I}ntelligence. In this work, we introduce \\textbf{VinciCoder}, a unified multimodal code generation model that addresses this limitation via a two-stage training framework. We begin by constructing a large-scale Supervised Finetuning (SFT) corpus comprising 1.6M image-code pairs for tasks involving direct code generation and visual-based code refinement. Subsequently, we introduce a Visual Reinforcement Learning (ViRL) strategy, which employs a coarse-to-fine reward mechanism to improve visual fidelity by calculating visual similarity across local and global image patches. Extensive experiments on diverse multimodal code generation benchmarks demonstrate that VinciCoder achieves state-of-the-art performance, surpassing recent open-source models. The ablation study further validates the effectiveness of our proposed coarse-to-fine ViRL strategy. The data, code and model is available at https://github.com/DocTron-hub/VinciCoder.","short_abstract":"Multimodal code generation has garnered significant interest within the research community. Despite the notable success of recent vision-language models (VLMs) on specialized tasks like chart-to-code generation, their reliance on single-task training regimens fosters a narrow paradigm that hinders the development of ge...","url_abs":"https://arxiv.org/abs/2511.00391","url_pdf":"https://arxiv.org/pdf/2511.00391v2","authors":"[\"Xuanle Zhao\",\"Deyang Jiang\",\"Zhixiong Zeng\",\"Lei Chen\",\"Haibo Qiu\",\"Jing Huang\",\"Yufeng Zhong\",\"Liming Zheng\",\"Yilin Cao\",\"Lin Ma\"]","published":"2025-11-01T04:05:26Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607490,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2847172,"paper_url":"https://arxiv.org/abs/2511.00391","paper_title":"VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning","repo_url":"https://github.com/DocTron-hub/VinciCoder","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
