{"ID":2860406,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.04354","arxiv_id":"2510.04354","title":"Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators","abstract":"Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large-scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned \\(π_0\\) on a joint distribution of objects and initial conditions, and find that our approach saves over \\(20-25\\%\\) of hardware evaluation effort to achieve similar bounds on policy performance.","short_abstract":"Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small num...","url_abs":"https://arxiv.org/abs/2510.04354","url_pdf":"https://arxiv.org/pdf/2510.04354v1","authors":"[\"Apurva Badithela\",\"David Snyder\",\"Lihan Zha\",\"Joseph Mikhail\",\"Matthew O'Kelly\",\"Anushri Dixit\",\"Anirudha Majumdar\"]","published":"2025-10-05T20:37:53Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.AI\",\"eess.SY\"]","methods":"[\"Diffusion Model\"]","has_code":false}