{"ID":2865804,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.20819","arxiv_id":"2509.20819","title":"Integrating and Characterizing HPC Task Runtime Systems for hybrid AI-HPC workloads","abstract":"Scientific workflows increasingly involve both HPC and machine-learning tasks, combining MPI-based simulations, training, and inference in a single execution. Launchers such as Slurm's srun constrain concurrency and throughput, making them unsuitable for dynamic and heterogeneous workloads. We present a performance study of RADICAL-Pilot (RP) integrated with Flux and Dragon, two complementary runtime systems that enable hierarchical resource management and high-throughput function execution. Using synthetic and production-scale workloads on Frontier, we characterize the task execution properties of RP across runtime configurations. RP+Flux sustains up to 930 tasks/s, and RP+Flux+Dragon exceeds 1,500 tasks/s with over 99.6% utilization. In contrast, srun peaks at 152 tasks/s and degrades with scale, with utilization below 50%. For IMPECCABLE.v2 drug discovery campaign, RP+Flux reduces makespan by 30-60% relative to srun/Slurm and increases throughput more than four times on up to 1,024. These results demonstrate hybrid runtime integration in RP as a scalable approach for hybrid AI-HPC workloads.","short_abstract":"Scientific workflows increasingly involve both HPC and machine-learning tasks, combining MPI-based simulations, training, and inference in a single execution. Launchers such as Slurm's srun constrain concurrency and throughput, making them unsuitable for dynamic and heterogeneous workloads. We present a performance stu...","url_abs":"https://arxiv.org/abs/2509.20819","url_pdf":"https://arxiv.org/pdf/2509.20819v1","authors":"[\"Andre Merzky\",\"Mikhail Titov\",\"Matteo Turilli\",\"Shantenu Jha\"]","published":"2025-09-25T07:01:51Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[]","has_code":false}
