{"ID":2842970,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.09557","arxiv_id":"2511.09557","title":"Understanding and Improving Communication Performance in Multi-node LLM Inference","abstract":"As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Because all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9$\\times$-3.6$\\times$ lower latency than NCCL for message sizes between 128 KB and 2 MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72$\\times$ reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.","short_abstract":"As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference...","url_abs":"https://arxiv.org/abs/2511.09557","url_pdf":"https://arxiv.org/pdf/2511.09557v4","authors":"[\"Prajwal Singhania\",\"Siddharth Singh\",\"Lannie Dalton Hough\",\"Akarsh Srivastava\",\"Harshitha Menon\",\"Charles Fredrick Jekel\",\"Abhinav Bhatele\"]","published":"2025-11-12T18:59:26Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
