{"ID":2880967,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.12631","arxiv_id":"2508.12631","title":"Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing","abstract":"Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models -- including GPT-5-medium, Gemini-2.5-pro, and Claude-opus-4.1 -- Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT-5-medium) by +7% in average accuracy. Moreover, it can match the average accuracy of the strongest single model at 27% lower cost, and reach ~90% of that performance at 63% lower cost. Last but not least, it achieves a Pareto frontier, consistently yielding the highest accuracy for any given cost, and the lowest cost for any given accuracy, among all single models. Code is available at https://github.com/ZhangYiqun018/AvengersPro.","short_abstract":"Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that en...","url_abs":"https://arxiv.org/abs/2508.12631","url_pdf":"https://arxiv.org/pdf/2508.12631v2","authors":"[\"Yiqun Zhang\",\"Hao Li\",\"Jianhao Chen\",\"Hangfan Zhang\",\"Peng Ye\",\"Lei Bai\",\"Shuyue Hu\"]","published":"2025-08-18T05:23:31Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":610749,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2880967,"paper_url":"https://arxiv.org/abs/2508.12631","paper_title":"Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing","repo_url":"https://github.com/ZhangYiqun018/AvengersPro","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}