{"ID":2869640,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.13642","arxiv_id":"2509.13642","title":"LLM-I: LLMs are Naturally Interleaved Multimodal Creators","abstract":"We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the \"one-tool\" bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: https://github.com/ByteDance-BandAI/LLM-I.","short_abstract":"We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the \"one-tool\" bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or pro...","url_abs":"https://arxiv.org/abs/2509.13642","url_pdf":"https://arxiv.org/pdf/2509.13642v1","authors":"[\"Zirun Guo\",\"Feng Zhang\",\"Kai Jia\",\"Tao Jin\"]","published":"2025-09-17T02:33:29Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\",\"Large Language Model\"]","has_code":false,"code_links":[{"ID":609706,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2869640,"paper_url":"https://arxiv.org/abs/2509.13642","paper_title":"LLM-I: LLMs are Naturally Interleaved Multimodal Creators","repo_url":"https://github.com/ByteDance-BandAI/LLM-I","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
