{"ID":2830321,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.11920","arxiv_id":"2512.11920","title":"CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving","abstract":"Large Language Models (LLMs) have revolutionized natural language processing tasks, but their deployment in datacenter environments faces significant challenges due to the massive memory requirements of key-value (KV) caches. During the autoregressive decoding process, KV caches consume substantial GPU memory, limiting batch sizes and overall system throughput. To address these challenges, we propose \\textbf{CXL-SpecKV}, a novel disaggregated KV-cache architecture that leverages Compute Express Link (CXL) interconnects and FPGA accelerators to enable efficient speculative execution and memory disaggregation. Our approach introduces three key innovations: (i) a CXL-based memory disaggregation framework that offloads KV-caches to remote FPGA memory with low latency, (ii) a speculative KV-cache prefetching mechanism that predicts and preloads future tokens' cache entries, and (iii) an FPGA-accelerated KV-cache compression and decompression engine that reduces memory bandwidth requirements by up to 4$\\times$. When evaluated on state-of-the-art LLM models, CXL-SpecKV achieves up to 3.2$\\times$ higher throughput compared to GPU-only baselines, while reducing memory costs by 2.8$\\times$ and maintaining accuracy. Our system demonstrates that intelligent memory disaggregation combined with speculative execution can effectively address the memory wall challenge in large-scale LLM serving. Our code implementation has been open-sourced at https://github.com/FastLM/CXL-SpecKV.","short_abstract":"Large Language Models (LLMs) have revolutionized natural language processing tasks, but their deployment in datacenter environments faces significant challenges due to the massive memory requirements of key-value (KV) caches. During the autoregressive decoding process, KV caches consume substantial GPU memory, limiting...","url_abs":"https://arxiv.org/abs/2512.11920","url_pdf":"https://arxiv.org/pdf/2512.11920v1","authors":"[\"Dong Liu\",\"Yanxuan Yu\"]","published":"2025-12-11T15:40:36Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":606025,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2830321,"paper_url":"https://arxiv.org/abs/2512.11920","paper_title":"CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving","repo_url":"https://github.com/FastLM/CXL-SpecKV","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
