{"ID":2823251,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2601.00644","arxiv_id":"2601.00644","title":"FlexSpec: Frozen Drafts Meet Evolving Targets in Edge-Cloud Collaborative LLM Speculative Decoding","abstract":"Deploying large language models (LLMs) in mobile and edge computing environments is constrained by limited on-device resources, scarce wireless bandwidth, and frequent model evolution. Although edge-cloud collaborative inference with speculative decoding (SD) can reduce end-to-end latency by executing a lightweight draft model at the edge and verifying it with a cloud-side target model, existing frameworks fundamentally rely on tight coupling between the two models. Consequently, repeated model synchronization introduces excessive communication overhead, increasing end-to-end latency, and ultimately limiting the scalability of SD in edge environments. To address these limitations, we propose FlexSpec, a communication-efficient collaborative inference framework tailored for evolving edge-cloud systems. The core design of FlexSpec is a shared-backbone architecture that allows a single and static edge-side draft model to remain compatible with a large family of evolving cloud-side target models. By decoupling edge deployment from cloud-side model updates, FlexSpec eliminates the need for edge-side retraining or repeated model downloads, substantially reducing communication and maintenance costs. Furthermore, to accommodate time-varying wireless conditions and heterogeneous device constraints, we develop a channel-aware adaptive speculation mechanism that dynamically adjusts the speculative draft length based on real-time channel state information and device energy budgets. Extensive experiments demonstrate that FlexSpec achieves superior performance compared to conventional SD approaches in terms of inference efficiency.","short_abstract":"Deploying large language models (LLMs) in mobile and edge computing environments is constrained by limited on-device resources, scarce wireless bandwidth, and frequent model evolution. Although edge-cloud collaborative inference with speculative decoding (SD) can reduce end-to-end latency by executing a lightweight dra...","url_abs":"https://arxiv.org/abs/2601.00644","url_pdf":"https://arxiv.org/pdf/2601.00644v1","authors":"[\"Yuchen Li\",\"Rui Kong\",\"Zhonghao Lyu\",\"Qiyang Li\",\"Xinran Chen\",\"Hengyi Cai\",\"Lingyong Yan\",\"Shuaiqiang Wang\",\"Jiashu Zhao\",\"Guangxu Zhu\",\"Linghe Kong\",\"Guihai Chen\",\"Haoyi Xiong\",\"Dawei Yin\"]","published":"2026-01-02T11:09:39Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
