{"ID":2868540,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.15515","arxiv_id":"2509.15515","title":"LLM Cache Bandit Revisited: Addressing Query Heterogeneity for Cost-Effective LLM Inference","abstract":"This paper revisits the LLM cache bandit problem, with a special focus on addressing the query heterogeneity for cost-effective LLM inference. Previous works often assume uniform query sizes. Heterogeneous query sizes introduce a combinatorial structure for cache selection, making the cache replacement process more computationally and statistically challenging. We treat optimal cache selection as a knapsack problem and employ an accumulation-based strategy to effectively balance computational overhead and cache updates. In theoretical analysis, we prove that the regret of our algorithm achieves an $O(\\sqrt{MNT})$ bound, improving the coefficient of $\\sqrt{MN}$ compared to the $O(MN\\sqrt{T})$ result in Berkeley, where $N$ is the total number of queries and $M$ is the cache size. Additionally, we also provide a problem-dependent bound, which was absent in previous works. The experiment rely on real-world data show that our algorithm reduces the total cost by approximately 12\\%.","short_abstract":"This paper revisits the LLM cache bandit problem, with a special focus on addressing the query heterogeneity for cost-effective LLM inference. Previous works often assume uniform query sizes. Heterogeneous query sizes introduce a combinatorial structure for cache selection, making the cache replacement process more com...","url_abs":"https://arxiv.org/abs/2509.15515","url_pdf":"https://arxiv.org/pdf/2509.15515v1","authors":"[\"Hantao Yang\",\"Hong Xie\",\"Defu Lian\",\"Enhong Chen\"]","published":"2025-09-19T01:39:08Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}