{"ID":2861808,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.00438","arxiv_id":"2510.00438","title":"BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration","abstract":"Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations. However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that specify complex spatial relationships, temporal logic, and interactions among multiple subjects. To address this issue, we propose BindWeave, a unified framework that handles a broad range of subject-to-video scenarios from single-subject cases to complex multi-subject scenes with heterogeneous entities. To bind complex prompt semantics to concrete visual subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities and disentangle roles, attributes, and interactions, yielding subject-aware hidden states that condition the diffusion transformer for high-fidelity subject-consistent video generation. Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos, outperforming existing open-source and commercial models.","short_abstract":"Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations. However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that s...","url_abs":"https://arxiv.org/abs/2510.00438","url_pdf":"https://arxiv.org/pdf/2510.00438v2","authors":"[\"Zhaoyang Li\",\"Dongjun Qian\",\"Kai Su\",\"Qishuai Diao\",\"Xiangyang Xia\",\"Chang Liu\",\"Wenfei Yang\",\"Tianzhu Zhang\",\"Zehuan Yuan\"]","published":"2025-10-01T02:41:11Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Transformer\",\"Large Language Model\",\"Language Model\"]","has_code":false}
