{"ID":2855246,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.13602","arxiv_id":"2510.13602","title":"NOSA: Native and Offloadable Sparse Attention","abstract":"Decoding throughput improvements from larger inference batches are limited by GPU memory, which is largely consumed by the key-value (KV) cache. Prior training-free KV cache offloading alleviates this by keeping redundant context on the CPU and fetching only a sparse subset for attention, but it often degrades long-generation quality due to training-inference mismatch on sparse patterns. Meanwhile, trainable sparse attention is incompatible with efficient offloading, as unconstrained KV accesses may force large CPU-to-GPU transfers and erase throughput gains. To this end, we propose NOSA, a trainable sparse attention mechanism natively designed for KV cache offloading. NOSA explicitly constrains the volume of CPU-GPU KV transfers, thereby achieving low communication overhead and high decoding throughput. We further build NOSI, a KV cache offloading inference system that fully unlocks NOSA's efficiency. Empirical results on 1,3,8B LLMs demonstrate that NOSA outperforms KV cache offloading baselines on general, long-input, and long-generation tasks, while boosting decoding throughput by up to 5.04x, 1.92x, and 1.83x over FullAttn, InfLLMv2, and ShadowKV, respectively. We release our code at https://github.com/thunlp/NOSA.","short_abstract":"Decoding throughput improvements from larger inference batches are limited by GPU memory, which is largely consumed by the key-value (KV) cache. Prior training-free KV cache offloading alleviates this by keeping redundant context on the CPU and fetching only a sparse subset for attention, but it often degrades long-gen...","url_abs":"https://arxiv.org/abs/2510.13602","url_pdf":"https://arxiv.org/pdf/2510.13602v2","authors":"[\"Yuxiang Huang\",\"Pengjie Wang\",\"Jicheng Han\",\"Weilin Zhao\",\"Zhou Su\",\"Ao Sun\",\"Hongya Lyu\",\"Hengyu Zhao\",\"Yudong Wang\",\"Chaojun Xiao\",\"Xu Han\",\"Zhiyuan Liu\"]","published":"2025-10-15T14:33:16Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":608233,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2855246,"paper_url":"https://arxiv.org/abs/2510.13602","paper_title":"NOSA: Native and Offloadable Sparse Attention","repo_url":"https://github.com/thunlp/NOSA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
