{"ID":2877797,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.20253","arxiv_id":"2508.20253","title":"SpeedMalloc: Improving Multi-threaded Applications via a Lightweight Core for Memory Allocation","abstract":"Memory allocation, though constituting only a small portion of the executed code, can have a \"butterfly effect\" on overall program performance, leading to significant and far-reaching impacts. Despite accounting for just approximately 5% of total instructions, memory allocation can result in up to a 2.7x performance variation depending on the allocator used. This effect arises from the complexity of memory allocation in modern multi-threaded multi-core systems, where allocator metadata becomes intertwined with user data, leading to cache pollution or increased cross-thread synchronization overhead. Offloading memory allocators to accelerators, e.g., Mallacc and Memento, is a potential direction to improve the allocator performance and mitigate cache pollution. However, these accelerators currently have limited support for multi-threaded applications, and synchronization between cores and accelerators remains a significant challenge. We present SpeedMalloc, using a lightweight support-core to process memory allocation tasks in multi-threaded applications. The support-core is a lightweight programmable processor with efficient cross-core data synchronization and houses all allocator metadata in its own caches. This design minimizes cache conflicts with user data and eliminates the need for cross-core metadata synchronization. In addition, using a general-purpose core instead of domain-specific accelerators makes SpeedMalloc capable of adopting new allocator designs. We compare SpeedMalloc with state-of-the-art software and hardware allocators, including Jemalloc, TCMalloc, Mimalloc, Mallacc, and Memento. SpeedMalloc achieves 1.75x, 1.18x, 1.15x, 1.23x, and 1.18x speedups on multithreaded workloads over these five allocators, respectively.","short_abstract":"Memory allocation, though constituting only a small portion of the executed code, can have a \"butterfly effect\" on overall program performance, leading to significant and far-reaching impacts. Despite accounting for just approximately 5% of total instructions, memory allocation can result in up to a 2.7x performance va...","url_abs":"https://arxiv.org/abs/2508.20253","url_pdf":"https://arxiv.org/pdf/2508.20253v1","authors":"[\"Ruihao Li\",\"Qinzhe Wu\",\"Krishna Kavi\",\"Gayatri Mehta\",\"Jonathan C. Beard\",\"Neeraja J. Yadwadkar\",\"Lizy K. John\"]","published":"2025-08-27T20:18:37Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.AR\"]","methods":"[]","has_code":false}
