{"ID":2858395,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.08731","arxiv_id":"2510.08731","title":"When to Reason: Semantic Router for vLLM","abstract":"Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token usage, with environmental and financial impacts, which are unnecessary for many simple prompts. We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial. Our approach achieves a 10.2 percentage point improvement in accuracy on the MMLU-Pro benchmark while reducing response latency by 47.1% and token consumption by 48.5% compared to direct inference with vLLM. These results demonstrate that semantic routing offers an effective mechanism for striking a balance between accuracy and efficiency in open-source LLM serving systems","short_abstract":"Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token usage, with environmental and financial impacts, which are unnecessary for many sim...","url_abs":"https://arxiv.org/abs/2510.08731","url_pdf":"https://arxiv.org/pdf/2510.08731v1","authors":"[\"Chen Wang\",\"Xunzhuo Liu\",\"Yuhan Liu\",\"Yue Zhu\",\"Xiangxi Mo\",\"Junchen Jiang\",\"Huamin Chen\"]","published":"2025-10-09T18:38:00Z","proceeding":"cs.ET","tasks":"[\"cs.ET\",\"cs.AI\",\"cs.CL\",\"eess.SY\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
