{"ID":2830801,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.09824","arxiv_id":"2512.09824","title":"Composing Concepts from Images and Videos via Concept-prompt Binding","abstract":"Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind \u0026 Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.","short_abstract":"Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind \u0026 Compose, a one-shot method...","url_abs":"https://arxiv.org/abs/2512.09824","url_pdf":"https://arxiv.org/pdf/2512.09824v2","authors":"[\"Xianghao Kong\",\"Zeyu Zhang\",\"Yuwei Guo\",\"Zhuoran Zhao\",\"Songchun Zhang\",\"Anyi Rao\"]","published":"2025-12-10T16:57:31Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.MM\"]","methods":"[\"Diffusion Model\",\"Transformer\"]","has_code":false}
