{"ID":2834359,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.01357","arxiv_id":"2512.01357","title":"Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity","abstract":"Serverless Large Language Models (LLMs) have emerged as a cost-effective solution for deploying AI services by enabling a 'pay-as-you-go' pricing model through GPU resource sharing. However, cold-start latency, especially the model loading phase, has become a critical performance bottleneck, as it scales linearly with model size and severely limits the practical deployment of large-scale LLM services. This paper presents Tangram, a novel system that accelerates Serverless LLM loading through efficient GPU memory reuse. By leveraging the unused GPU memory to retain model parameters, Tangram significantly reduces model transfer time and cold-start latency. Its design includes three key components: unified GPU memory pool for tensor-level parameter sharing across models, on-demand KV cache allocation for dynamic memory management, and GPU-affinity-aware scheduling for maximizing resource utilization. These techniques collectively address the critical challenges of inefficient memory usage and the cold-start problem in Serverless LLM platforms. We have implemented a fully functional prototype, and experiments show that Tangram achieves up to 6.2 times faster loading and reduces Time-To-First-Token (TTFT) during cold-start by 23--55% over state-of-the-art methods.","short_abstract":"Serverless Large Language Models (LLMs) have emerged as a cost-effective solution for deploying AI services by enabling a 'pay-as-you-go' pricing model through GPU resource sharing. However, cold-start latency, especially the model loading phase, has become a critical performance bottleneck, as it scales linearly with...","url_abs":"https://arxiv.org/abs/2512.01357","url_pdf":"https://arxiv.org/pdf/2512.01357v1","authors":"[\"Wenbin Zhu\",\"Zhaoyan Shen\",\"Zili Shao\",\"Hongjun Dai\",\"Feng Chen\"]","published":"2025-12-01T07:10:34Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.AI\",\"cs.AR\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
