{"ID":2827435,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.16273","arxiv_id":"2512.16273","title":"Fast Collaborative Inference via Distributed Speculative Decoding","abstract":"Speculative decoding accelerates large language model (LLM) inference by allowing a small draft model to predict multiple future tokens for verification by a larger target model. In AI-native radio access networks (AI-RAN), this enables device-edge collaborative inference but introduces significant uplink overhead, as existing distributed speculative decoding schemes transmit full vocabulary logits at every step. We propose a sparsify-then-sample strategy, Truncated Sparse Logits Transmission (TSLT), which transmits only the logits and indices of a truncated candidate set. We provide theoretical guarantees showing that the acceptance rate is preserved under TSLT. TSLT is further extended to multi-candidate case, where multiple draft candidates per step increase acceptance probability. Experiments show that TSLT significantly reduces uplink communication while maintaining end-to-end inference latency and model quality, demonstrating its effectiveness for scalable, communication-efficient distributed LLM inference in future AI-RAN systems.","short_abstract":"Speculative decoding accelerates large language model (LLM) inference by allowing a small draft model to predict multiple future tokens for verification by a larger target model. In AI-native radio access networks (AI-RAN), this enables device-edge collaborative inference but introduces significant uplink overhead, as...","url_abs":"https://arxiv.org/abs/2512.16273","url_pdf":"https://arxiv.org/pdf/2512.16273v2","authors":"[\"Ce Zheng\",\"Ke Zhang\",\"Chen Sun\",\"Wenqi Zhang\",\"Qiong Liu\",\"Angesom Ataklity Tesfay\"]","published":"2025-12-18T07:49:52Z","proceeding":"eess.SP","tasks":"[\"eess.SP\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
