{"ID":2824511,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.23858","arxiv_id":"2512.23858","title":"Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding","abstract":"Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative decoding through context-aware tree drafting and compiler-friendly execution. Yggdrasil introduces an equal-growth tree structure for static graph compatibility, a latency-aware optimization objective for draft selection, and stage-based scheduling to reduce overhead. Yggdrasil supports unmodified LLMs and achieves up to $3.98\\times$ speedup over state-of-the-art baselines across multiple hardware setups.","short_abstract":"Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative d...","url_abs":"https://arxiv.org/abs/2512.23858","url_pdf":"https://arxiv.org/pdf/2512.23858v1","authors":"[\"Yue Guan\",\"Changming Yu\",\"Shihan Fang\",\"Weiming Hu\",\"Zaifeng Pan\",\"Zheng Wang\",\"Zihan Liu\",\"Yangjie Zhou\",\"Yufei Ding\",\"Minyi Guo\",\"Jingwen Leng\"]","published":"2025-12-29T20:51:38Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.PL\"]","methods":"[\"Large Language Model\"]","has_code":false}
