{"ID":2886953,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.02446","arxiv_id":"2508.02446","title":"TeraNoC: A Multi-Channel 32-bit Fine-Grained, Hybrid Mesh-Crossbar NoC for Efficient Scale-up of 1000+ Core Shared-L1-Memory Clusters","abstract":"A key challenge in on-chip interconnect design is to scale up bandwidth while maintaining low latency and high area efficiency. 2D-meshes scale with low wiring area and congestion overhead; however, their end-to-end latency increases with the number of hops, making them unsuitable for latency-sensitive core-to-L1-memory access. On the other hand, crossbars offer low latency, but their routing complexity grows quadratically with the number of I/Os, requiring large physical routing resources and limiting area-efficient scalability. This two-sided interconnect bottleneck hinders the scale-up of many-core, low-latency, tightly coupled shared-memory clusters, pushing designers toward instantiating many smaller and loosely coupled clusters, at the cost of hardware and software overheads. We present TeraNoC, an open-source, hybrid mesh-crossbar on-chip interconnect that offers both scalability and low latency, while maintaining very low routing overhead. The topology, built on 32bit word-width multi-channel 2D-meshes and crossbars, enables the area-efficient scale-up of shared-memory clusters. A router remapper is designed to balance traffic load across interconnect channels. Using TeraNoC, we build a cluster with 1024 single-stage, single-issue cores that share a 4096-banked L1 memory, implemented in 12nm technology. The low interconnect stalls enable high compute utilization of up to 0.85 IPC in compute-intensive, data-parallel key GenAI kernels. TeraNoC only consumes 7.6\\% of the total cluster power in kernels dominated by crossbar accesses, and 22.7\\% in kernels with high 2D-mesh traffic. Compared to a hierarchical crossbar-only cluster, TeraNoC reduces die area by 37.8\\% and improves area efficiency (GFLOP/s/mm2) by up to 98.7\\%, while occupying only 10.9\\% of the logic area.","short_abstract":"A key challenge in on-chip interconnect design is to scale up bandwidth while maintaining low latency and high area efficiency. 2D-meshes scale with low wiring area and congestion overhead; however, their end-to-end latency increases with the number of hops, making them unsuitable for latency-sensitive core-to-L1-memor...","url_abs":"https://arxiv.org/abs/2508.02446","url_pdf":"https://arxiv.org/pdf/2508.02446v1","authors":"[\"Yichao Zhang\",\"Zexin Fu\",\"Tim Fischer\",\"Yinrong Li\",\"Marco Bertuletti\",\"Luca Benini\"]","published":"2025-08-04T14:08:28Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[]","has_code":false}