{"ID":2874714,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.04548","arxiv_id":"2509.04548","title":"Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model","abstract":"Recent advances in multimodal models have demonstrated impressive capabilities in unified image generation and editing. However, many prominent open-source models prioritize scaling model parameters over optimizing training strategies, limiting their efficiency and performance. In this work, we present UniPic2-SD3.5M-Kontext, a 2B-parameter DiT model based on SD3.5-Medium, which achieves state-of-the-art image generation and editing while extending seamlessly into a unified multimodal framework. Our approach begins with architectural modifications to SD3.5-Medium and large-scale pre-training on high-quality data, enabling joint text-to-image generation and editing capabilities. To enhance instruction following and editing consistency, we propose a novel Progressive Dual-Task Reinforcement strategy (PDTR), which effectively strengthens both tasks in a staged manner. We empirically validate that the reinforcement phases for different tasks are mutually beneficial and do not induce negative interference. After pre-training and reinforcement strategies, UniPic2-SD3.5M-Kontext demonstrates stronger image generation and editing capabilities than models with significantly larger generation parameters-including BAGEL (7B) and Flux-Kontext (12B). Furthermore, following the MetaQuery, we connect the UniPic2-SD3.5M-Kontext and Qwen2.5-VL-7B via a connector and perform joint training to launch a unified multimodal model UniPic2-Metaquery. UniPic2-Metaquery integrates understanding, generation, and editing, achieving top-tier performance across diverse tasks with a simple and scalable training paradigm. This consistently validates the effectiveness and generalizability of our proposed training paradigm, which we formalize as Skywork UniPic 2.0.","short_abstract":"Recent advances in multimodal models have demonstrated impressive capabilities in unified image generation and editing. However, many prominent open-source models prioritize scaling model parameters over optimizing training strategies, limiting their efficiency and performance. In this work, we present UniPic2-SD3.5M-K...","url_abs":"https://arxiv.org/abs/2509.04548","url_pdf":"https://arxiv.org/pdf/2509.04548v2","authors":"[\"Hongyang Wei\",\"Baixin Xu\",\"Hongbo Liu\",\"Size Wu\",\"Jie Liu\",\"Yi Peng\",\"Peiyu Wang\",\"Zexiang Liu\",\"Jingwen He\",\"Yidan Xietian\",\"Chuanxin Tang\",\"Zidong Wang\",\"Yichen Wei\",\"Liang Hu\",\"Boyi Jiang\",\"Wei Li\",\"Ying He\",\"Yang Liu\",\"Xuchen Song\",\"Yangguang Li\",\"Yahui Zhou\"]","published":"2025-09-04T17:00:17Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}
