{"ID":2855314,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.13747","arxiv_id":"2510.13747","title":"InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue","abstract":"We introduce InteractiveOmni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction, ranging from 4B to 8B parameters, designed to lead the field of lightweight models by offering comprehensive omni-modal understanding and speech generation capabilities. To achieve this, we integrate the vision encoder, audio encoder, large language model, and speech decoder into a unified model for understanding and generation tasks. We design a multi-stage training strategy to ensure robust cross-modal capabilities, including pre-training for omni-modal understanding, followed by post-training with speech conversation and audio-visual interaction. To enable human-like long-term conversational ability, we meticulously curate a multi-turn training dataset that enhances the model's ability to handle complex and multi-turn interactions. To effectively evaluate the multi-turn memory and speech interaction capabilities, we construct the multi-modal multi-turn memory benchmark and the multi-turn speech interaction benchmark. Experiments demonstrate that InteractiveOmni significantly outperforms leading open-source models and provides a more intelligent multi-turn audio-visual experience, particularly in its long-term memory capabilities. Notably, InteractiveOmni-4B is comparable to the much larger model like Qwen2.5-Omni-7B on general benchmarks, and it can retain 97% of the performance of the InteractiveOmni-8B while utilizing only 50% of the model size. Achieving state-of-the-art results against similarly sized models across image, audio, video understanding, and speech generation tasks, InteractiveOmni is an accessible, open-source foundation for next-generation intelligent interactive systems.","short_abstract":"We introduce InteractiveOmni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction, ranging from 4B to 8B parameters, designed to lead the field of lightweight models by offering comprehensive omni-modal understanding and speech generation capabilities. To achieve this, we i...","url_abs":"https://arxiv.org/abs/2510.13747","url_pdf":"https://arxiv.org/pdf/2510.13747v2","authors":"[\"Wenwen Tong\",\"Hewei Guo\",\"Dongchuan Ran\",\"Jiangnan Chen\",\"Jiefan Lu\",\"Kaibin Wang\",\"Keqiang Li\",\"Xiaoxu Zhu\",\"Jiakui Li\",\"Kehan Li\",\"Xueheng Li\",\"Lumin Li\",\"Chenxu Guo\",\"Jiasheng Zhou\",\"Jiandong Chen\",\"Xianye Wu\",\"Jiahao Wang\",\"Silei Wu\",\"Lei Chen\",\"Hanming Deng\",\"Yuxuan Song\",\"Dinghao Zhou\",\"Guiping Zhong\",\"Ken Zheng\",\"Shiyin Kang\",\"Lewei Lu\"]","published":"2025-10-15T16:52:48Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false}
