{"ID":2834099,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.02924","arxiv_id":"2512.02924","title":"AutoNeural: Co-Designing Vision-Language Models for NPU Inference","abstract":"While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.","short_abstract":"While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O...","url_abs":"https://arxiv.org/abs/2512.02924","url_pdf":"https://arxiv.org/pdf/2512.02924v2","authors":"[\"Wei Chen\",\"Liangmin Wu\",\"Yunhai Hu\",\"Zhiyuan Li\",\"Zhiyuan Cheng\",\"Yicheng Qian\",\"Lingyue Zhu\",\"Zhipeng Hu\",\"Luoyi Liang\",\"Qiang Tang\",\"Zhen Liu\",\"Han Yang\"]","published":"2025-12-02T16:45:25Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Vision Transformer\",\"Transformer\",\"Language Model\"]","has_code":false}