{"ID":2824630,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.22905","arxiv_id":"2512.22905","title":"JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation","abstract":"This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for joint audio-video (JAV) comprehension and generation. JavisGPT has a concise encoder-LLM-decoder architecture, which has a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. For instruction tuning, we construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that cover diverse and multi-level comprehension and generation scenarios. On JAV comprehension and generation benchmarks, our experiments show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.","short_abstract":"This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for joint audio-video (JAV) comprehension and generation. JavisGPT has a concise encoder-LLM-decoder architecture, which has a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a...","url_abs":"https://arxiv.org/abs/2512.22905","url_pdf":"https://arxiv.org/pdf/2512.22905v2","authors":"[\"Kai Liu\",\"Jungang Li\",\"Yuchong Sun\",\"Shengqiong Wu\",\"Jianzhang Gao\",\"Daoan Zhang\",\"Wei Zhang\",\"Sheng Jin\",\"Sicheng Yu\",\"Geng Zhan\",\"Jiayi Ji\",\"Fan Zhou\",\"Liang Zheng\",\"Shuicheng Yan\",\"Hao Fei\",\"Tat-Seng Chua\"]","published":"2025-12-28T12:25:43Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}