{"ID":2827032,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.17436","arxiv_id":"2512.17436","title":"Xiaomi MiMo-VL-Miloco Technical Report","abstract":"We open-source MiMo-VL-Miloco-7B and its quantized variant MiMo-VL-Miloco-7B-GGUF, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at https://github.com/XiaoMi/xiaomi-mimo-vl-miloco to support research and deployment in real-world smart-home applications.","short_abstract":"We open-source MiMo-VL-Miloco-7B and its quantized variant MiMo-VL-Miloco-7B-GGUF, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environ...","url_abs":"https://arxiv.org/abs/2512.17436","url_pdf":"https://arxiv.org/pdf/2512.17436v2","authors":"[\"Jiaze Li\",\"Jingyang Chen\",\"Yuxun Qu\",\"Shijie Xu\",\"Zhenru Lin\",\"Junyou Zhu\",\"Boshen Xu\",\"Wenhui Tan\",\"Pei Fu\",\"Jianzhong Ju\",\"Zhenbo Luo\",\"Jian Luan\"]","published":"2025-12-19T10:43:37Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false,"code_links":[{"ID":605779,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2827032,"paper_url":"https://arxiv.org/abs/2512.17436","paper_title":"Xiaomi MiMo-VL-Miloco Technical Report","repo_url":"https://github.com/XiaoMi/xiaomi-mimo-vl-miloco","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}