{"ID":2824406,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.23568","arxiv_id":"2512.23568","title":"ThinkGen: Generalized Thinking for Visual Generation","abstract":"Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen","short_abstract":"Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In...","url_abs":"https://arxiv.org/abs/2512.23568","url_pdf":"https://arxiv.org/pdf/2512.23568v1","authors":"[\"Siyu Jiao\",\"Yiheng Lin\",\"Yujie Zhong\",\"Qi She\",\"Wei Zhou\",\"Xiaohan Lan\",\"Zilong Huang\",\"Fei Yu\",\"Yingchen Yu\",\"Yunqing Zhao\",\"Yao Zhao\",\"Yunchao Wei\"]","published":"2025-12-29T16:08:50Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\",\"Transformer\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":605591,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2824406,"paper_url":"https://arxiv.org/abs/2512.23568","paper_title":"ThinkGen: Generalized Thinking for Visual Generation","repo_url":"https://github.com/jiaosiyuu/ThinkGen","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}