{"ID":2833121,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.05033","arxiv_id":"2512.05033","title":"Arbitrage: Efficient Reasoning via Advantage-Aware Speculation","abstract":"Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\\sim2\\times$ at matched accuracy.","short_abstract":"Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast...","url_abs":"https://arxiv.org/abs/2512.05033","url_pdf":"https://arxiv.org/pdf/2512.05033v2","authors":"[\"Monishwaran Maheswaran\",\"Rishabh Tiwari\",\"Yuezhou Hu\",\"Kerem Dilmen\",\"Coleman Hooper\",\"Haocheng Xi\",\"Nicholas Lee\",\"Mehrdad Farajtabar\",\"Michael W. Mahoney\",\"Kurt Keutzer\",\"Amir Gholami\"]","published":"2025-12-04T17:50:53Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
