{"ID":2835920,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.22333","arxiv_id":"2511.22333","title":"PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel","abstract":"LLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory. Meanwhile, real-world workloads exhibit substantial, hierarchical shared prefixes across requests (e.g., system prompts, tools/templates, RAG). Existing attention implementations fail to fully exploit prefix sharing: one-query-per-CTA execution repeatedly loads shared prefix KV cache, while one-size-fits-all tiling leaves on-chip resources idle and exacerbates bubbles for uneven KV lengths. These choices amplify memory bandwidth pressure and stall memory-bound decode attention. This paper introduces PAT, a prefix-aware attention kernel implementation for LLM decoding that organizes execution with a pack-forward-merge paradigm. PAT packs queries by shared prefix to reduce repeated memory accesses, runs a customized multi-tile kernel to achieve high resource efficiency. It further applies practical multi-stream forwarding and KV splitting to reduce resource bubbles. The final merge performs online softmax with negligible overhead. We implement PAT as an off-the-shelf plugin for vLLM. Evaluation on both real-world and synthetic workloads shows that PAT reduces attention latency by 53.5% on average and TPOT by 17.0-93.1% under the same configurations against state-of-the-art attention kernels. PAT's source code is publicly available at https://github.com/flashserve/PAT.","short_abstract":"LLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory. Meanwhile, real-world workloads exhibit substantial, hierarchical shared prefixes across requests (e.g., system prompts, tools/templates, RAG). Existing attention implementati...","url_abs":"https://arxiv.org/abs/2511.22333","url_pdf":"https://arxiv.org/pdf/2511.22333v3","authors":"[\"Jinjun Yi\",\"Zhixin Zhao\",\"Yitao Hu\",\"Ke Yan\",\"Weiwei Sun\",\"Hao Wang\",\"Laiping Zhao\",\"Yuhao Zhang\",\"Wenxin Li\",\"Keqiu Li\"]","published":"2025-11-27T11:10:30Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Generative Adversarial Network\"]","has_code":false,"code_links":[{"ID":606553,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2835920,"paper_url":"https://arxiv.org/abs/2511.22333","paper_title":"PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel","repo_url":"https://github.com/flashserve/PAT","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
