{"ID":2843674,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.06605","arxiv_id":"2511.06605","title":"DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication","abstract":"Offloading communication to existing direct memory access (DMA) engines, available on most state-of-the-art commercial GPUs, has emerged as an interesting and low-cost solution to efficiently overlap computation and communication in machine learning (ML). That said, so far, the reach of DMA offloads has been limited to bandwidth-bound scenarios only (10s of MB to GB transfer sizes). In this work, we aim to break this barrier and expand the reach of DMA communication offloads to even latency-bound regions (KB to low MB). Specifically, we discuss in this work hitherto untapped features available in the state-of-the-art AMD Instinct$^{\\mathrm{TM}}$ MI300X GPUs that render DMA communication offloads competitive even for latency-bound regions. We demonstrate the efficacy of these features at the operator-level (ML communication collectives such as all-gather and all-to-all), and also at the end-to-end workload-level (LLM inference). For the former, our optimized DMA offloads close up to 4.5$\\times$ performance gap and deliver additional power savings (3-10%) for ML collectives as compared to state-of-the-art GPU core-based communication library, RCCL. For the latter, we demonstrate acceleration for LLM inference: up to 1.5$\\times$ lower latency and up to 1.9$\\times$ higher throughput over the state-of-the-art vLLM inference framework. We conclude with a discussion of AMD Instinct GPU runtime innovations that stand to expose these features and additionally identify future hardware-software co-design potential to further improve DMA offload efficiency.","short_abstract":"Offloading communication to existing direct memory access (DMA) engines, available on most state-of-the-art commercial GPUs, has emerged as an interesting and low-cost solution to efficiently overlap computation and communication in machine learning (ML). That said, so far, the reach of DMA offloads has been limited to...","url_abs":"https://arxiv.org/abs/2511.06605","url_pdf":"https://arxiv.org/pdf/2511.06605v2","authors":"[\"Suchita Pati\",\"Shaizeen Aga\",\"Mahzabeen Islam\",\"Ryan Quach\",\"Saleel Kudchadker\",\"Mohamed Assem Ibrahim\"]","published":"2025-11-10T01:28:58Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.AR\"]","methods":"[\"Large Language Model\"]","has_code":false}