{"ID":2850283,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.23649","arxiv_id":"2510.23649","title":"Efficient Low Rank Attention for Long-Context Inference in Large Language Models","abstract":"As the length of input text increases, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce memory usage but suffer from numerical precision loss or suboptimal retention of key-value pairs. In this work, Low Rank Query and Key attention (LRQK) is introduced, a two-stage framework that jointly decomposes full-precision query and key matrices into compact rank-\\(r\\) factors during the prefill stage, and then employs these low-dimensional projections to compute proxy attention scores in \\(\\mathcal{O}(lr)\\) time at each decode step. By selecting only the top-\\(k\\) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU-CPU cache with a hit-and-miss mechanism where only missing full-precision KV pairs are transferred, thereby preserving exact attention outputs while reducing CPU-GPU data movement. Extensive experiments on the RULER and LongBench benchmarks with LLaMA-3-8B and Qwen2.5-7B demonstrate that LRQK matches or surpasses leading sparse-attention methods in long context settings, while delivering significant memory savings with minimal accuracy loss. Our code is available at https://github.com/tenghuilee/LRQK.","short_abstract":"As the length of input text increases, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce memory usage but suffer from numerical precision loss or suboptimal retention o...","url_abs":"https://arxiv.org/abs/2510.23649","url_pdf":"https://arxiv.org/pdf/2510.23649v3","authors":"[\"Tenghui Li\",\"Guoxu Zhou\",\"Xuyang Zhao\",\"Yuning Qiu\",\"Qibin Zhao\"]","published":"2025-10-25T11:43:27Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607787,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2850283,"paper_url":"https://arxiv.org/abs/2510.23649","paper_title":"Efficient Low Rank Attention for Long-Context Inference in Large Language Models","repo_url":"https://github.com/tenghuilee/LRQK","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
