{"ID":2886806,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.02215","arxiv_id":"2508.02215","title":"LeanK: Learnable K Cache Channel Pruning for Efficient Decoding","abstract":"Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific sparsity ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at https://aka.ms/LeanK.","short_abstract":"Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wis...","url_abs":"https://arxiv.org/abs/2508.02215","url_pdf":"https://arxiv.org/pdf/2508.02215v1","authors":"[\"Yike Zhang\",\"Zhiyuan He\",\"Huiqiang Jiang\",\"Chengruidong Zhang\",\"Yuqing Yang\",\"Jianyong Wang\",\"Lili Qiu\"]","published":"2025-08-04T09:08:43Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","project_urls":"[\"https://aka.ms/LeanK\"]","has_code":false,"code_links":[{"ID":611351,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886806,"paper_url":"https://arxiv.org/abs/2508.02215","paper_title":"LeanK: Learnable K Cache Channel Pruning for Efficient Decoding","repo_url":"https://github.com/microsoft/MInference","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":611352,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886806,"paper_url":"https://arxiv.org/abs/2508.02215","paper_title":"LeanK: Learnable K Cache Channel Pruning for Efficient Decoding","repo_url":"https://github.com/microsoft/MInference.git","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":611353,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886806,"paper_url":"https://arxiv.org/abs/2508.02215","paper_title":"LeanK: Learnable K Cache Channel Pruning for Efficient Decoding","repo_url":"https://github.com/features/copilot","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":611354,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886806,"paper_url":"https://arxiv.org/abs/2508.02215","paper_title":"LeanK: Learnable K Cache Channel Pruning for Efficient Decoding","repo_url":"https://github.com/features/spark","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":611355,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886806,"paper_url":"https://arxiv.org/abs/2508.02215","paper_title":"LeanK: Learnable K Cache Channel Pruning for Efficient Decoding","repo_url":"https://github.com/features/models","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":611356,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886806,"paper_url":"https://arxiv.org/abs/2508.02215","paper_title":"LeanK: Learnable K Cache Channel Pruning for Efficient Decoding","repo_url":"https://github.com/features/actions","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":611357,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886806,"paper_url":"https://arxiv.org/abs/2508.02215","paper_title":"LeanK: Learnable K Cache Channel Pruning for Efficient Decoding","repo_url":"https://github.com/features/codespaces","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":611358,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886806,"paper_url":"https://arxiv.org/abs/2508.02215","paper_title":"LeanK: Learnable K Cache Channel Pruning for Efficient Decoding","repo_url":"https://github.com/features/issues","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":611359,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886806,"paper_url":"https://arxiv.org/abs/2508.02215","paper_title":"LeanK: Learnable K Cache Channel Pruning for Efficient Decoding","repo_url":"https://github.com/features/code-review","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":611360,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886806,"paper_url":"https://arxiv.org/abs/2508.02215","paper_title":"LeanK: Learnable K Cache Channel Pruning for Efficient Decoding","repo_url":"https://github.com/security/advanced-security","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":611361,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886806,"paper_url":"https://arxiv.org/abs/2508.02215","paper_title":"LeanK: Learnable K Cache Channel Pruning for Efficient Decoding","repo_url":"https://github.com/enterprise/startups","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":611362,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886806,"paper_url":"https://arxiv.org/abs/2508.02215","paper_title":"LeanK: Learnable K Cache Channel Pruning for Efficient Decoding","repo_url":"https://github.com/solutions/industry","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":611363,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886806,"paper_url":"https://arxiv.org/abs/2508.02215","paper_title":"LeanK: Learnable K Cache Channel Pruning for Efficient Decoding","repo_url":"https://github.com/solutions/use-case","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}