{"ID":2865909,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.20979","arxiv_id":"2509.20979","title":"Toward Robust and Efficient ML-Based GPU Caching for Modern Inference","abstract":"In modern GPU inference, cache efficiency remains a major bottleneck, and heuristic policies such as \\textsc{LRU} can perform far worse than the offline optimum. Existing learning-based caching systems improve hit rates mainly through predictor design, but often follow learned predictions blindly, making performance unreliable when predictions are inaccurate. In contrast, emerging learning-augmented caching algorithms~\\cite{pmlr-v80-lykouris18a,mitzenmacher2022algorithms} provide performance guarantees by carefully integrating predictions into caching policies, achieving both \\emph{consistency} (near-optimality under perfect predictions) and \\emph{robustness} (bounded worst-case performance under prediction errors). However, deployment remains challenging. A practical algorithm should satisfy strict time and space efficiency constraints, which some theoretical work overlooks, while also incurring low deployment overhead. We propose learning-augmented LRU, a deployment-oriented learning-augmented caching algorithm that guarantees \\emph{1-consistency} and \\emph{$O(k)$-robustness}, incurs low time and space overhead, and maintains strong compatibility. We further build a GPU cache, called \\textsc{LCR}, on top of learning-augmented LRU to benefit from its theoretical guarantees and translate them into practical performance. In experiments, \\textsc{LCR} reduces P99 time-to-first-token (TTFT) by up to 28.3\\% on LLM workloads and increases throughput by up to 24.2\\% on deep learning recommendation (DLRM) workloads. Even with poor predictions, performance degrades gracefully and remains close to \\textsc{LRU}, demonstrating robustness with practical value.","short_abstract":"In modern GPU inference, cache efficiency remains a major bottleneck, and heuristic policies such as \\textsc{LRU} can perform far worse than the offline optimum. Existing learning-based caching systems improve hit rates mainly through predictor design, but often follow learned predictions blindly, making performance un...","url_abs":"https://arxiv.org/abs/2509.20979","url_pdf":"https://arxiv.org/pdf/2509.20979v2","authors":"[\"Peng Chen\",\"Jiaji Zhang\",\"Hailiang Zhao\",\"Yirong Zhang\",\"Shenyao Chen\",\"Jiahong Yu\",\"Xueyan Tang\",\"Yixuan Wang\",\"Hao Li\",\"Jianping Zou\",\"Gang Xiong\",\"Kingsum Chow\",\"Shuibing He\",\"Shuiguang Deng\"]","published":"2025-09-25T10:23:50Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Large Language Model\"]","has_code":false}