{"ID":2921087,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-04T07:41:34.29888543Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01852","arxiv_id":"2606.01852","title":"Parallelizing Large-Scale Tensor Network Contraction on Multiple GPUs","abstract":"Exact tensor network contraction underpins quantum circuit simulation, quantum error correction, combinatorial optimization, and many-body dynamics. The dominant parallelization strategy, slicing, scales exponentially and incurs redundant computation. We present a multi-GPU framework that instead distributes intermediate tensors across devices with explicit communication, converting a fixed contraction path into a communication-efficient schedule via GEMM-oriented mode reordering and communication-aware mode distribution planning. Within a single DGX H100 node (8 GPUs, NVLink), distribution delivers $7$--$173\\times$ extra speedup beyond embarrassingly parallel slicing, capturing nearly all of the available compute reduction (87--101%) because NVLink's high bandwidth keeps communication small relative to compute. Scaling the same four workloads to 1024 H100 GPUs over InfiniBand, the extra speedup beyond slicing ranges from $42\\times$ to $67{,}869\\times$, demonstrating that communication-aware distributed contraction far surpasses slicing-based scaling limits for frontier tensor networks.","short_abstract":"Exact tensor network contraction underpins quantum circuit simulation, quantum error correction, combinatorial optimization, and many-body dynamics. The dominant parallelization strategy, slicing, scales exponentially and incurs redundant computation. We present a multi-GPU framework that instead distributes intermedia...","url_abs":"https://arxiv.org/abs/2606.01852","url_pdf":"https://arxiv.org/pdf/2606.01852v1","authors":"[\"Feng Pan\",\"Hanfeng Gu\",\"Paul Springer\",\"Xipeng Li\"]","published":"2026-06-01T08:02:29Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"quant-ph\"]","methods":"[]","has_code":false}