{"ID":2868007,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.16857","arxiv_id":"2509.16857","title":"ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching","abstract":"Distributed prefix caching accelerates long-context LLM serving by reusing KV cache entries for common context prefixes. However, KV cache fetches can become a bottleneck when network bandwidth is limited. Compression mitigates the bandwidth issue, but can degrade overall performance when decompression interferes with model computation. We present ShadowServe, the first SmartNIC-accelerated, interference-free prefix caching system for LLM serving. ShadowServe separates a control plane on the host and a data plane fully offloaded to the SmartNIC, which eliminates interference to both host GPU and CPU. To overcome the SmartNIC's limited compute and memory resources, we design a chunked pipeline that parallelizes data plane operations across the SmartNIC's compute resources, and a minimal-copy memory management scheme that reduces memory pressure on the SmartNIC. Compared to state-of-the-art solutions, ShadowServe achieves up to 2.2x lower loaded time-per-output-token (TPOT), and reduces time-to-first-token (TTFT) by up to 1.38x in low-bandwidth scenarios (\u003c= 20 Gbps), translating to up to 1.35x higher throughput.","short_abstract":"Distributed prefix caching accelerates long-context LLM serving by reusing KV cache entries for common context prefixes. However, KV cache fetches can become a bottleneck when network bandwidth is limited. Compression mitigates the bandwidth issue, but can degrade overall performance when decompression interferes with...","url_abs":"https://arxiv.org/abs/2509.16857","url_pdf":"https://arxiv.org/pdf/2509.16857v1","authors":"[\"Xingyu Xiang\",\"Raj Joshi\",\"Yuhan Liu\",\"Jiayi Yao\",\"Chenxingyu Zhao\",\"Junchen Jiang\",\"Yang Zhou\",\"Eddie Kohler\",\"Minlan Yu\"]","published":"2025-09-21T00:59:45Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Large Language Model\"]","has_code":false}
