{"ID":2854259,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.15870","arxiv_id":"2510.15870","title":"OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM","abstract":"Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.","short_abstract":"Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model archit...","url_abs":"https://arxiv.org/abs/2510.15870","url_pdf":"https://arxiv.org/pdf/2510.15870v2","authors":"[\"Hanrong Ye\",\"Chao-Han Huck Yang\",\"Arushi Goel\",\"Wei Huang\",\"Ligeng Zhu\",\"Yuanhang Su\",\"Sean Lin\",\"An-Chieh Cheng\",\"Zhen Wan\",\"Jinchuan Tian\",\"Yuming Lou\",\"Dong Yang\",\"Zhijian Liu\",\"Yukang Chen\",\"Ambrish Dantrey\",\"Ehsan Jahangiri\",\"Sreyan Ghosh\",\"Daguang Xu\",\"Ehsan Hosseini-Asl\",\"Danial Mohseni Taheri\",\"Vidya Murali\",\"Sifei Liu\",\"Yao Lu\",\"Oluwatobi Olabiyi\",\"Yu-Chiang Frank Wang\",\"Rafael Valle\",\"Bryan Catanzaro\",\"Andrew Tao\",\"Song Han\",\"Jan Kautz\",\"Hongxu Yin\",\"Pavlo Molchanov\"]","published":"2025-10-17T17:59:59Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Graph Neural Network\"]","has_code":false}