{"ID":2886086,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.03034","arxiv_id":"2508.03034","title":"MoCA: Identity-Preserving Text-to-Video Generation via Mixture of Cross Attention","abstract":"Achieving ID-preserving text-to-video (T2V) generation remains challenging despite recent advances in diffusion-based models. Existing approaches often fail to capture fine-grained facial dynamics or maintain temporal identity coherence. To address these limitations, we propose MoCA, a novel Video Diffusion Model built on a Diffusion Transformer (DiT) backbone, incorporating a Mixture of Cross-Attention mechanism inspired by the Mixture-of-Experts paradigm. Our framework improves inter-frame identity consistency by embedding MoCA layers into each DiT block, where Hierarchical Temporal Pooling captures identity features over varying timescales, and Temporal-Aware Cross-Attention Experts dynamically model spatiotemporal relationships. We further incorporate a Latent Video Perceptual Loss to enhance identity coherence and fine-grained details across video frames. To train this model, we collect CelebIPVid, a dataset of 10,000 high-resolution videos from 1,000 diverse individuals, promoting cross-ethnicity generalization. Extensive experiments on CelebIPVid show that MoCA outperforms existing T2V methods by over 5% across Face similarity.","short_abstract":"Achieving ID-preserving text-to-video (T2V) generation remains challenging despite recent advances in diffusion-based models. Existing approaches often fail to capture fine-grained facial dynamics or maintain temporal identity coherence. To address these limitations, we propose MoCA, a novel Video Diffusion Model built...","url_abs":"https://arxiv.org/abs/2508.03034","url_pdf":"https://arxiv.org/pdf/2508.03034v2","authors":"[\"Qi Xie\",\"Yongjia Ma\",\"Donglin Di\",\"Xuehao Gao\",\"Xun Yang\"]","published":"2025-08-05T03:24:08Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Transformer\"]","has_code":false}
