{"ID":2836425,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.21413","arxiv_id":"2511.21413","title":"Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM","abstract":"Due to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging. The utilisation of High-Performance Computing (HPC) has become a prevalent approach for the implementation of such solutions. However, the classical operating model of HPC does not adapt well to the requirements of synchronous, user-facing dynamic AI application workloads. In this paper, we propose our solution that serves LLMs by integrating vLLM, Slurm and Kubernetes on the supercomputer \\textit{RAMSES}. The initial benchmark indicates that the proposed architecture scales efficiently for 100, 500 and 1000 concurrent requests, incurring only an overhead of approximately 500 ms in terms of end-to-end latency.","short_abstract":"Due to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging. The utilisation of High-Performance Computing (HPC) has become a prevalent approach for the implementation of such solutions. However, the classical operating...","url_abs":"https://arxiv.org/abs/2511.21413","url_pdf":"https://arxiv.org/pdf/2511.21413v1","authors":"[\"Tim Trappen\",\"Robert Keßler\",\"Roland Pabel\",\"Viktor Achter\",\"Stefan Wesner\"]","published":"2025-11-26T14:06:22Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.AI\",\"cs.DB\",\"cs.PF\"]","methods":"[\"Large Language Model\"]","has_code":false}