{"ID":3083893,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T03:54:17.966829144Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05875","arxiv_id":"2606.05875","title":"QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving","abstract":"Retrieval-augmented generation (RAG) improves large language model (LLM) answer quality by grounding generation in external evidence, but processing retrieved contexts makes the prefill stage a dominant serving cost. RAG cache fusion reduces this cost by reusing precomputed key-value (KV) caches for retrieved chunks and selectively recomputing tokens under the current prompt. Existing selectors, however, face a dilemma between quality and efficiency: fast query-agnostic or final-layer query-to-context selectors can miss request-relevant evidence, whereas full-view query-aware selectors require broad context and layer visibility before recomputation and therefore stall the layer-wise cache-fusion pipeline. We present QCFuse, a compressed-view query-aware selector for RAG cache fusion. QCFuse uses chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens without all-layer inspection. We implement QCFuse in SGLang and evaluate it on four open-weight LLMs across six datasets. QCFuse reaches full-prefill-level quality. At matched quality, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.","short_abstract":"Retrieval-augmented generation (RAG) improves large language model (LLM) answer quality by grounding generation in external evidence, but processing retrieved contexts makes the prefill stage a dominant serving cost. RAG cache fusion reduces this cost by reusing precomputed key-value (KV) caches for retrieved chunks an...","url_abs":"https://arxiv.org/abs/2606.05875","url_pdf":"https://arxiv.org/pdf/2606.05875v1","authors":"[\"Jianxin Yan\",\"Wangze Ni\",\"Zhenxin Li\",\"Jiabao Jin\",\"Zhitao Shen\",\"Haoyang Li\",\"Jia Zhu\",\"Peng Cheng\",\"Xuemin Lin\",\"Lei Chen\",\"Kui Ren\"]","published":"2026-06-04T08:47:46Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.DB\"]","methods":"[\"RAG\",\"Large Language Model\",\"Language Model\"]","has_code":false}