{"ID":2844647,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.06010","arxiv_id":"2511.06010","title":"MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference","abstract":"The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7x over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference.","short_abstract":"The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by expl...","url_abs":"https://arxiv.org/abs/2511.06010","url_pdf":"https://arxiv.org/pdf/2511.06010v1","authors":"[\"Myunghyun Rhee\",\"Sookyung Choi\",\"Euiseok Kim\",\"Joonseop Sim\",\"Youngpyo Joo\",\"Hoshik Kim\"]","published":"2025-11-08T13:40:16Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.DC\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
