{"ID":2887892,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.00477","arxiv_id":"2508.00477","title":"LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer","abstract":"In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.","short_abstract":"In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios...","url_abs":"https://arxiv.org/abs/2508.00477","url_pdf":"https://arxiv.org/pdf/2508.00477v2","authors":"[\"Yuzhuo Chen\",\"Zehua Ma\",\"Jianhua Wang\",\"Kai Kang\",\"Shunyu Yao\",\"Weiming Zhang\"]","published":"2025-08-01T09:51:54Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Transformer\"]","has_code":false,"code_links":[{"ID":611483,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2887892,"paper_url":"https://arxiv.org/abs/2508.00477","paper_title":"LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer","repo_url":"https://github.com/Suchenl/LAMIC","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}