{"ID":2840834,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.13940","arxiv_id":"2511.13940","title":"ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels","abstract":"Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU performance$\\unicode{x2014}$data-transfer mechanisms, resource scheduling, and design overheads. We validate PK on both Hopper and Blackwell architectures. With fewer than 50 lines of device code, PK achieves up to $2.33 \\times$ speedup for data- and tensor-parallel workloads, $4.08 \\times$ for sequence-parallel workloads, and $1.22 \\times$ for expert-parallel workloads.","short_abstract":"Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across he...","url_abs":"https://arxiv.org/abs/2511.13940","url_pdf":"https://arxiv.org/pdf/2511.13940v1","authors":"[\"Stuart H. Sul\",\"Simran Arora\",\"Benjamin F. Spector\",\"Christopher Ré\"]","published":"2025-11-17T21:48:33Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.LG\"]","methods":"[]","has_code":false}