{"ID":2836361,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.21780","arxiv_id":"2511.21780","title":"3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation","abstract":"Text-to-video (T2V) diffusion models have recently achieved impressive visual quality, yet most systems still generate silent clips and treat audio as a secondary concern. Existing audio-video generation pipelines typically decompose the task into cascaded stages, which accumulate errors across modalities and are trained under separate objectives. Recent joint audio-video generators alleviate this issue but often rely on dual-tower architectures with ad-hoc cross-modal bridges and static, single-shot text conditioning, making it difficult to both reuse T2V backbones and to reason about how audio, video and language interact over time. To address these challenges, we propose 3MDiT, a unified tri-modal diffusion transformer for text-driven synchronized audio-video generation. Our framework models video, audio and text as jointly evolving streams: an isomorphic audio branch mirrors a T2V backbone, tri-modal omni-blocks perform feature-level fusion across the three modalities, and an optional dynamic text conditioning mechanism updates the text representation as audio and video evidence co-evolve. The design supports two regimes: training from scratch on audio-video data, and orthogonally adapting a pretrained T2V model without modifying its backbone. Experiments show that our approach generates high-quality videos and realistic audio while consistently improving audio-video synchronization and tri-modal alignment across a range of quantitative metrics.","short_abstract":"Text-to-video (T2V) diffusion models have recently achieved impressive visual quality, yet most systems still generate silent clips and treat audio as a secondary concern. Existing audio-video generation pipelines typically decompose the task into cascaded stages, which accumulate errors across modalities and are train...","url_abs":"https://arxiv.org/abs/2511.21780","url_pdf":"https://arxiv.org/pdf/2511.21780v1","authors":"[\"Yaoru Li\",\"Heyu Si\",\"Federico Landi\",\"Pilar Oplustil Gallegos\",\"Ioannis Koutsoumpas\",\"O. Ricardo Cortez Vazquez\",\"Ruiju Fu\",\"Qi Guo\",\"Xin Jin\",\"Shunyu Liu\",\"Mingli Song\"]","published":"2025-11-26T11:25:26Z","proceeding":"cs.MM","tasks":"[\"cs.MM\",\"cs.SD\"]","methods":"[\"Diffusion Model\",\"Transformer\"]","has_code":false}