{"ID":2877802,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.20258","arxiv_id":"2508.20258","title":"SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization","abstract":"Large language models (LLMs) have shown progress in GPU kernel performance engineering using inefficient search-based methods that optimize around runtime. Any existing approach lacks a key characteristic that human performance engineers rely on for near-optimal utilization -- hardware-awareness. By leveraging the workload's specific memory access patterns, architecture specifications, filtered profiling logs, and reflections on historical performance, we can make software-level optimizations that are tailored to the underlying hardware. SwizzlePerf automatically generates spatial optimizations for GPU kernels on disaggregated architectures by giving LLMs explicit hardware-awareness. For a GEMM kernel, SwizzlePerf takes less than 5 minutes to generate the same hardware-specific optimal swizzling pattern that took expert performance engineers 2 weeks to find. On a suite of 10 diverse ML and Science kernels, SwizzlePerf can generate swizzling patterns for 9 of the kernels that achieve up to a 2.06x speedup and 70% improvement in L2 hit rate. This work is the first of many steps toward systematically creating hardware-aware LLM performance engineering agents.","short_abstract":"Large language models (LLMs) have shown progress in GPU kernel performance engineering using inefficient search-based methods that optimize around runtime. Any existing approach lacks a key characteristic that human performance engineers rely on for near-optimal utilization -- hardware-awareness. By leveraging the work...","url_abs":"https://arxiv.org/abs/2508.20258","url_pdf":"https://arxiv.org/pdf/2508.20258v1","authors":"[\"Arya Tschand\",\"Muhammad Awad\",\"Ryan Swann\",\"Kesavan Ramakrishnan\",\"Jeffrey Ma\",\"Keith Lowery\",\"Ganesh Dasika\",\"Vijay Janapa Reddi\"]","published":"2025-08-27T20:30:43Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}