{"ID":2828328,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.14052","arxiv_id":"2512.14052","title":"HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices","abstract":"Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.","short_abstract":"Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Tran...","url_abs":"https://arxiv.org/abs/2512.14052","url_pdf":"https://arxiv.org/pdf/2512.14052v1","authors":"[\"HyperAI Team\",\"Yuchen Liu\",\"Kaiyang Han\",\"Zhiqiang Xia\",\"Yuhang Dong\",\"Chen Song\",\"Kangyu Tang\",\"Jiaming Xu\",\"Xiushi Feng\",\"WenXuan Yu\",\"Li Peng\",\"Mingyang Wang\",\"Kai Wang\",\"Changpeng Yang\",\"Yang Li\",\"Haoyu Lu\",\"Hao Wang\",\"Bingna Xu\",\"Guangyao Liu\",\"Long Huang\",\"Kaibin Guo\",\"Jinyang Wu\",\"Dan Wu\",\"Hongzhen Wang\",\"Peng Zhou\",\"Shuai Nie\",\"Shande Wang\",\"Runyu Shi\",\"Ying Huang\"]","published":"2025-12-16T03:36:41Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\"]","methods":"[\"Vision Transformer\",\"Transformer\",\"Large Language Model\",\"Language Model\"]","has_code":false}