{"ID":2827328,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.16056","arxiv_id":"2512.16056","title":"MultiPath Memory Access: Breaking Host-GPU Bandwidth Bottlenecks in LLM Services","abstract":"Host-GPU data movement has become a latency-critical bottleneck in LLM serving, surfacing in common paths such as model-weight movement and KV cache offload/fetch. Today, each host-GPU copy is effectively confined to the PCIe path of the target GPU, even though modern multi-GPU servers contain additional PCIe links on peer GPUs and high bandwidth GPU interconnects. This leaves substantial intra-server I/O capacity unused. To address this issue, we present Multipath Memory Access (MMA), a software-defined multipath memory access system for host--GPU data transfer. To the best of our knowledge, MMA is the first software-defined system to enable efficient multipath host--GPU data transfer within a single multi-GPU server. MMA expands a single host--GPU copy across available direct and relay paths without hardware, driver, or application changes. It preserves CUDA stream semantics with a dependency-preserving Dummy Task, coordinates distributed micro-transfer completion through a lightweight synchronization mechanism, and uses queue backpressure to route traffic without explicit link-state feedback. On an 8-GPU NVIDIA H20 server, MMA achieves 245 GB/s peak host-to-GPU bandwidth, a 4.62x improvement over native CUDA copies, and reduces TTFT for KV cache fetching by 1.14-2.38x and model wake-up/switching latency by 1.12-2.48x.","short_abstract":"Host-GPU data movement has become a latency-critical bottleneck in LLM serving, surfacing in common paths such as model-weight movement and KV cache offload/fetch. Today, each host-GPU copy is effectively confined to the PCIe path of the target GPU, even though modern multi-GPU servers contain additional PCIe links on...","url_abs":"https://arxiv.org/abs/2512.16056","url_pdf":"https://arxiv.org/pdf/2512.16056v2","authors":"[\"Lingfeng Tang\",\"Daoping Zhang\",\"Junjie Chen\",\"Peihao Huang\",\"Feng Jin\",\"Chengguang Xu\",\"Yuxin Chen\",\"Feiqiang Sun\",\"Guo Chen\"]","published":"2025-12-18T00:45:00Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.NI\",\"cs.PF\"]","methods":"[\"Large Language Model\"]","has_code":false}
