{"ID":2873241,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.06274","arxiv_id":"2509.06274","title":"IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs","abstract":"Routing incoming queries to the most cost-effective LLM while maintaining response quality poses a fundamental challenge in optimizing performance-cost trade-offs for large-scale commercial systems. We present IPR\\, -- \\,a quality-constrained \\textbf{I}ntelligent \\textbf{P}rompt \\textbf{R}outing framework that dynamically selects optimal models based on predicted response quality and user-specified tolerance levels. IPR introduces three key innovations: (1) a modular architecture with lightweight quality estimators trained on 1.5M prompts annotated with calibrated quality scores, enabling fine-grained quality prediction across model families; (2) a user-controlled routing mechanism with tolerance parameter $τ\\in [0,1]$ that provides explicit control over quality-cost trade-offs; and (3) an extensible design using frozen encoders with model-specific adapters, reducing new model integration from days to hours. To rigorously train and evaluate IPR, we curate an industrial-level dataset IPRBench\\footnote{IPRBench will be released upon legal approval.}, a comprehensive benchmark containing 1.5 million examples with response quality annotations across 11 LLM candidates. Deployed on a major cloud platform, IPR achieves 43.9\\% cost reduction while maintaining quality parity with the strongest model in the Claude family and processes requests with sub-150ms latency. The deployed system and additional product details are publicly available at https://aws.amazon.com/bedrock/intelligent-prompt-routing/","short_abstract":"Routing incoming queries to the most cost-effective LLM while maintaining response quality poses a fundamental challenge in optimizing performance-cost trade-offs for large-scale commercial systems. We present IPR\\, -- \\,a quality-constrained \\textbf{I}ntelligent \\textbf{P}rompt \\textbf{R}outing framework that dynamica...","url_abs":"https://arxiv.org/abs/2509.06274","url_pdf":"https://arxiv.org/pdf/2509.06274v4","authors":"[\"Aosong Feng\",\"Balasubramaniam Srinivasan\",\"Yun Zhou\",\"Zhichao Xu\",\"Kang Zhou\",\"Sheng Guan\",\"Yueyan Chen\",\"Xian Wu\",\"Ninad Kulkarni\",\"Yi Zhang\",\"Zhengyuan Shen\",\"Dmitriy Bespalov\",\"Soumya Smruti Mishra\",\"Yifei Teng\",\"Darren Yow-Bang Wang\",\"Haibo Ding\",\"Lin Lee Cheong\"]","published":"2025-09-08T01:46:27Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Large Language Model\"]","project_urls":"[\"https://aws.amazon.com/bedrock/intelligent-prompt-routing/\"]","has_code":false}