{"ID":2853481,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.16606","arxiv_id":"2510.16606","title":"Reimagining RDMA Through the Lens of ML","abstract":"As distributed machine learning (ML) workloads scale to thousands of GPUs connected by ultra-high-speed inter-connects, tail latency in collective communication has emerged as a primary bottleneck. Prior RDMA designs, like RoCE, IRN, and SRNIC, enforce strict reliability and in-order delivery, relying on retransmissions and packet sequencing to ensure correctness. While effective for general-purpose workloads, these mechanisms introduce complexity and latency that scale poorly, where even rare packet losses or delays can consistently degrade system performance. We introduce Celeris, a domain-specific RDMA transport that revisits traditional reliability guarantees based on ML's tolerance for lost or partial data. Celeris removes retransmissions and in-order delivery from the RDMA NIC, enabling best-effort transport that exploits the robustness of ML workloads. It retains congestion control (e.g., DCQCN) and manages communication with software-level mechanisms such as adaptive timeouts and data prioritization, while shifting loss recovery to the ML pipeline (e.g., using the Hadamard Transform). Early results show that Celeris reduces 99th-percentile latency by up to 2.3x, cuts BRAM usage by 67%, and nearly doubles NIC resilience to faults -- delivering a resilient, scalable transport tailored for ML at cluster scale.","short_abstract":"As distributed machine learning (ML) workloads scale to thousands of GPUs connected by ultra-high-speed inter-connects, tail latency in collective communication has emerged as a primary bottleneck. Prior RDMA designs, like RoCE, IRN, and SRNIC, enforce strict reliability and in-order delivery, relying on retransmission...","url_abs":"https://arxiv.org/abs/2510.16606","url_pdf":"https://arxiv.org/pdf/2510.16606v1","authors":"[\"Ertza Warraich\",\"Ali Imran\",\"Annus Zulfiqar\",\"Shay Vargaftik\",\"Sonia Fahmy\",\"Muhammad Shahbaz\"]","published":"2025-10-18T18:08:30Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.NI\"]","methods":"[]","has_code":false}
