{"ID":2892597,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.03674","arxiv_id":"2508.03674","title":"Morphlux: Transforming Torus Fabrics for Efficient Multi-tenant ML","abstract":"We develop Morphlux, a server-scale programmable photonic fabric to interconnect accelerators within servers. We show that augmenting state-of-the-art torus-based ML data-centers with Morphlux can improve the bandwidth of tenant compute allocations by up to 66%, reduce compute fragmentation by up to 70%, and minimize the blast radius of chip failures. We develop a novel end-to-end hardware prototype of Morphlux to demonstrate these performance benefits which translate to 1.72X improvement in training throughput of ML models. By rapidly programming the server-scale fabric in our hardware testbed, Morphlux can replace a failed accelerator chip with a healthy one in 1.2 seconds.","short_abstract":"We develop Morphlux, a server-scale programmable photonic fabric to interconnect accelerators within servers. We show that augmenting state-of-the-art torus-based ML data-centers with Morphlux can improve the bandwidth of tenant compute allocations by up to 66%, reduce compute fragmentation by up to 70%, and minimize t...","url_abs":"https://arxiv.org/abs/2508.03674","url_pdf":"https://arxiv.org/pdf/2508.03674v3","authors":"[\"Abhishek Vijaya Kumar\",\"Eric Ding\",\"Arjun Devraj\",\"Darius Bunandar\",\"Rachee Singh\"]","published":"2025-07-20T12:40:21Z","proceeding":"cs.NI","tasks":"[\"cs.NI\",\"cs.AR\",\"cs.LG\"]","methods":"[]","has_code":false}