{"ID":2899521,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.00507","arxiv_id":"2507.00507","title":"Towards Resource-Efficient Serverless LLM Inference with SLINFER","abstract":"The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-sized models and infrequent requests. While existing serverless solutions follow exclusive GPU allocation, we take a step back to explore modern platforms and find that: Emerging CPU architectures with built-in accelerators are capable of serving LLMs but remain underutilized, and both CPUs and GPUs can accommodate multiple LLMs simultaneously. We propose SLINFER, a resource-efficient serverless inference scheme tailored for small- to mid-sized LLMs that enables elastic and on-demand sharing across heterogeneous hardware. SLINFER tackles three fundamental challenges: (1) precise, fine-grained compute resource allocation at token-level to handle fluctuating computational demands; (2) a coordinated and forward-looking memory scaling mechanism to detect out-of-memory hazards and reduce operational overhead; and (3) a dual approach that consolidates fragmented instances through proactive preemption and reactive bin-packing. Experimental results on 4 32-core CPUs and 4 A100 GPUs show that SLINFER improves serving capacity by 47% - 62% through sharing, while further leveraging CPUs boosts this to 86% - 154%.","short_abstract":"The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-sized models and infrequent requests. While existing serverless solutions follow exclusive GPU allocation, we take a step back to explore modern platforms and find that: Emerging CPU architectures with built-in accelerators...","url_abs":"https://arxiv.org/abs/2507.00507","url_pdf":"https://arxiv.org/pdf/2507.00507v2","authors":"[\"Chuhao Xu\",\"Zijun Li\",\"Quan Chen\",\"Han Zhao\",\"Xueyan Tang\",\"Minyi Guo\"]","published":"2025-07-01T07:22:39Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[\"Large Language Model\"]","has_code":false}
