{"ID":2851982,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.18905","arxiv_id":"2510.18905","title":"3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency","abstract":"AI inference scaling is often tuned through 1D heuristics (a fixed reasoning pass) or 2D bivariate trade-offs (e.g., accuracy vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environment-adaptive selection of the inference scaling~$k$. Results show that knee-point optimization based on Pareto frontiers achieves the best balance, while accuracy-maximization remains favorable when accuracy is prioritized. Our results further show that smaller models, when combined with optimal inference scaling, can match or exceed the performance of larger models at a fraction of the cost. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational conditions.","short_abstract":"AI inference scaling is often tuned through 1D heuristics (a fixed reasoning pass) or 2D bivariate trade-offs (e.g., accuracy vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, e...","url_abs":"https://arxiv.org/abs/2510.18905","url_pdf":"https://arxiv.org/pdf/2510.18905v3","authors":"[\"Minseok Jung\",\"Abhas Ricky\",\"Muhammad Rameez Chatni\"]","published":"2025-10-21T01:03:46Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false}
