{"ID":2867449,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.19645","arxiv_id":"2509.19645","title":"Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling","abstract":"Test-time scaling (TTS) has recently emerged as a promising direction to exploit the hidden reasoning capabilities of pre-trained large language models (LLMs). However, existing scaling methods narrowly focus on the compute-optimal Pareto-frontier, ignoring the simple fact that compute-optimal is not always system-optimal. In this work, we propose a system-driven perspective on TTS, analyzing how reasoning models scale against practical metrics, such as latency and cost-per-token. By evaluating the impact of popular optimizations such as tensor parallelism and speculative decoding, our preliminary analysis reveals the limitations of current methods and calls for a paradigm shift toward holistic, system-aware evaluations that capture the true essence of scaling laws at inference time.","short_abstract":"Test-time scaling (TTS) has recently emerged as a promising direction to exploit the hidden reasoning capabilities of pre-trained large language models (LLMs). However, existing scaling methods narrowly focus on the compute-optimal Pareto-frontier, ignoring the simple fact that compute-optimal is not always system-opti...","url_abs":"https://arxiv.org/abs/2509.19645","url_pdf":"https://arxiv.org/pdf/2509.19645v1","authors":"[\"Youpeng Zhao\",\"Jinpeng LV\",\"Di Wu\",\"Jun Wang\",\"Christopher Gooley\"]","published":"2025-09-23T23:52:07Z","proceeding":"cs.PF","tasks":"[\"cs.PF\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
