{"ID":2862780,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.26231","arxiv_id":"2509.26231","title":"IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance","abstract":"Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be available at https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment.","short_abstract":"Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of gen...","url_abs":"https://arxiv.org/abs/2509.26231","url_pdf":"https://arxiv.org/pdf/2509.26231v1","authors":"[\"Jiayi Guo\",\"Chuanhao Yan\",\"Xingqian Xu\",\"Yulin Wang\",\"Kai Wang\",\"Gao Huang\",\"Humphrey Shi\"]","published":"2025-09-30T13:27:03Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608944,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2862780,"paper_url":"https://arxiv.org/abs/2509.26231","paper_title":"IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance","repo_url":"https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}