{"ID":2845805,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.03475","arxiv_id":"2511.03475","title":"ContextPilot: Fast Long-Context Inference via Context Reuse","abstract":"AI applications increasingly depend on long-context inference, where LLMs consume substantial context to support stronger reasoning. Common examples include retrieval-augmented generation, agent memory layers, and multi-agent orchestration. As input contexts get longer, prefill latency becomes the main bottleneck. Yet today's prefill acceleration techniques face a trade-off: they either preserve reasoning quality but deliver little KV-cache reuse, or improve reuse at the cost of degraded reasoning quality. We present ContextPilot, a system that accelerates prefill by introducing context reuse as a new mechanism for faster long-context inference. ContextPilot introduces a context index to identify overlapping context blocks across LLM interactions (e.g., across users and turns). It further proposes context ordering and de-duplication techniques to maximize KV-cache reuse. To preserve reasoning quality under reuse, it introduces succinct context annotations that prevent quality degradation. Finally, ContextPilot is built around a modular architecture with a clean interface that integrates with existing inference engines. Extensive evaluation shows that ContextPilot reduces LLM prefill latency by up to $3\\times{}$ compared to state-of-the-art methods while preserving reasoning quality. At longer context lengths, it can even improve reasoning quality. ContextPilot is open-sourced at: https://github.com/EfficientContext/ContextPilot.","short_abstract":"AI applications increasingly depend on long-context inference, where LLMs consume substantial context to support stronger reasoning. Common examples include retrieval-augmented generation, agent memory layers, and multi-agent orchestration. As input contexts get longer, prefill latency becomes the main bottleneck. Yet...","url_abs":"https://arxiv.org/abs/2511.03475","url_pdf":"https://arxiv.org/pdf/2511.03475v4","authors":"[\"Yinsicheng Jiang\",\"Yeqi Huang\",\"Liang Cheng\",\"Cheng Deng\",\"Xuan Sun\",\"Luo Mai\"]","published":"2025-11-05T13:59:01Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"RAG\",\"Large Language Model\"]","has_code":false,"code_links":[{"ID":607385,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2845805,"paper_url":"https://arxiv.org/abs/2511.03475","paper_title":"ContextPilot: Fast Long-Context Inference via Context Reuse","repo_url":"https://github.com/EfficientContext/ContextPilot","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
