{"ID":2832266,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.06443","arxiv_id":"2512.06443","title":"Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices","abstract":"Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-based inference, CPUs run these ultra-low-bit LLMs even faster than NPUs, opening new opportunities for ubiquitous on-device intelligence. However, this paper identifies that LUT-based inference underutilizes memory bandwidth during parallel inference, which is required for prefilling, test-time scaling, and other multi-token scenarios. The root cause is the scalar LUT paradigm, which performs repetitive and non-contiguous memory accesses for each token. To solve the issue, we propose vector LUT, a new lookup paradigm that constructs a unified LUT across parallel tokens, and performs a single $1 \\rightarrow N$ lookup per index. To realize it efficiently, we further introduce (1) Vector LUT-Centric Tensor Layout, and (2) Cache-Aware Streamed Lookup techniques. Evaluations on 5 edge devices across 3 LLMs show that Vec-LUT outperforms state-of-the-art baselines by up to $4.2\\times$. Our implementation is integrated into llama.cpp. The code is available at https://github.com/OpenBitSys/vlut.cpp.","short_abstract":"Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-based inference, CPUs run these ultra-low-bit LLMs even faster than NPUs, opening...","url_abs":"https://arxiv.org/abs/2512.06443","url_pdf":"https://arxiv.org/pdf/2512.06443v2","authors":"[\"Xiangyu Li\",\"Chengyu Yin\",\"Weijun Wang\",\"Jianyu Wei\",\"Ting Cao\",\"Yunxin Liu\"]","published":"2025-12-06T14:14:01Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":606218,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2832266,"paper_url":"https://arxiv.org/abs/2512.06443","paper_title":"Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices","repo_url":"https://github.com/OpenBitSys/vlut.cpp","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
