{"ID":2828672,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2601.19910","arxiv_id":"2601.19910","title":"Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading","abstract":"KV cache offloading enables long-context LLM inference by storing caches in CPU DRAM, but PCIe bandwidth limitations create severe bottlenecks. In this paper, we develops an analytical framework that derives $κ_{\\text{crit}}$, the critical cached-to-prefill token ratio where execution becomes memory-bound and show typical workloads exceed this threshold by orders of magnitude. Empirical characterization reveals 99\\% of latency spent on transfers and serving offloaded requests results in GPU's consuming only 28\\% of their rated TDP, motivating our proposed optimizations for hardware interconnects, model architectures, and scheduling algorithms.","short_abstract":"KV cache offloading enables long-context LLM inference by storing caches in CPU DRAM, but PCIe bandwidth limitations create severe bottlenecks. In this paper, we develops an analytical framework that derives $κ_{\\text{crit}}$, the critical cached-to-prefill token ratio where execution becomes memory-bound and show typi...","url_abs":"https://arxiv.org/abs/2601.19910","url_pdf":"https://arxiv.org/pdf/2601.19910v1","authors":"[\"William Meng\",\"Benjamin Lee\",\"Hong Wang\"]","published":"2025-12-16T19:29:13Z","proceeding":"cs.AR","tasks":"[\"cs.AR\",\"cs.DC\"]","methods":"[\"Large Language Model\"]","has_code":false}
