{"ID":2856125,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.11026","arxiv_id":"2510.11026","title":"GIR-Bench: Versatile Benchmark for Generating Images with Reasoning","abstract":"Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce GIR-Bench, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at https://github.com/HKUST-LongGroup/GIR-Bench.","short_abstract":"Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between underst...","url_abs":"https://arxiv.org/abs/2510.11026","url_pdf":"https://arxiv.org/pdf/2510.11026v2","authors":"[\"Hongxiang Li\",\"Yaowei Li\",\"Bin Lin\",\"Yuwei Niu\",\"Yuhang Yang\",\"Xiaoshuang Huang\",\"Jiayin Cai\",\"Xiaolong Jiang\",\"Yao Hu\",\"Long Chen\"]","published":"2025-10-13T05:50:44Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608315,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2856125,"paper_url":"https://arxiv.org/abs/2510.11026","paper_title":"GIR-Bench: Versatile Benchmark for Generating Images with Reasoning","repo_url":"https://github.com/HKUST-LongGroup/GIR-Bench","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}