{"ID":2897553,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.05043","arxiv_id":"2507.05043","title":"MoLink: Distributed and Efficient Serving Framework for Large Models","abstract":"Large language models represent a groundbreaking shift in generative AI. Yet, these advances come with a significant challenge: the high cost of model serving. To mitigate these costs, consumer-grade GPUs emerge as a more affordable alternative. This presents an opportunity for more cost-efficient LLM serving by leveraging these GPUs. However, it is non-trivial to achieve high-efficiency LLM serving on consumer-grade GPUs, mainly due to two challenges: 1) these GPUs are often deployed in limited network conditions; 2) these GPUs often exhibit heterogeneity in host systems. To address these challenges, we present MoLink, a distributed LLM serving system for large models. It incorporates several key techniques, enabling efficient LLM serving on heterogeneous and weakly connected consumer-grade GPUs. Our experiments demonstrate that it achieves throughput improvements of up to 458\\% and cost-profit margin improvements of up to 151\\%, compared to state-of-the-art systems. MoLink allows users on Windows, Linux, and containerized VMs to seamlessly integrate GPUs with just a few lines of code over Ethernet or public networks. Currently, it supports 18 mainstream architectures of open-source large language models. The source code is publicly available https://github.com/oldcpple/MoLink.","short_abstract":"Large language models represent a groundbreaking shift in generative AI. Yet, these advances come with a significant challenge: the high cost of model serving. To mitigate these costs, consumer-grade GPUs emerge as a more affordable alternative. This presents an opportunity for more cost-efficient LLM serving by levera...","url_abs":"https://arxiv.org/abs/2507.05043","url_pdf":"https://arxiv.org/pdf/2507.05043v2","authors":"[\"Lewei Jin\",\"Yongqi Chen\",\"Kui Zhang\",\"Yifan Zhuo\",\"Yi Gao\",\"Bowei Yang\",\"Zhengong Cai\",\"Wei Dong\"]","published":"2025-07-07T14:27:56Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":612347,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2897553,"paper_url":"https://arxiv.org/abs/2507.05043","paper_title":"MoLink: Distributed and Efficient Serving Framework for Large Models","repo_url":"https://github.com/oldcpple/MoLink","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
