{"ID":2873575,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.06945","arxiv_id":"2509.06945","title":"Interleaving Reasoning for Better Text-to-Image Generation","abstract":"Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .","short_abstract":"Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances...","url_abs":"https://arxiv.org/abs/2509.06945","url_pdf":"https://arxiv.org/pdf/2509.06945v2","authors":"[\"Wenxuan Huang\",\"Shuang Chen\",\"Zheyong Xie\",\"Shaosheng Cao\",\"Shixiang Tang\",\"Yufan Shen\",\"Qingyu Yin\",\"Wenbo Hu\",\"Xiaoman Wang\",\"Yuntian Tang\",\"Junbo Qiao\",\"Yue Guo\",\"Yao Hu\",\"Zhenfei Yin\",\"Philip Torr\",\"Yu Cheng\",\"Wanli Ouyang\",\"Shaohui Lin\"]","published":"2025-09-08T17:56:23Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.CL\",\"cs.LG\"]","methods":"[\"Generative Adversarial Network\"]","has_code":false,"code_links":[{"ID":610067,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2873575,"paper_url":"https://arxiv.org/abs/2509.06945","paper_title":"Interleaving Reasoning for Better Text-to-Image Generation","repo_url":"https://github.com/Osilly/Interleaving-Reasoning-Generation","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
