{"ID":2851039,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.21860","arxiv_id":"2510.21860","title":"Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence","abstract":"We present Butter-Bench, a benchmark evaluating large language model (LLM) controlled robots for practical intelligence, defined as the ability to navigate the messiness of the physical world. Current state-of-the-art robotic systems use a hierarchical architecture with LLMs in charge of high-level reasoning, and a Vision Language Action (VLA) model for low-level control. Butter-Bench evaluates the LLM part in isolation from the VLA. Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench. The best LLMs score 40% on Butter-Bench, while the mean human score is 95%. LLMs struggled the most with multi-step spatial planning and social understanding. We also evaluate LLMs that are fine-tuned for embodied reasoning and conclude that this training does not improve their score on Butter-Bench.","short_abstract":"We present Butter-Bench, a benchmark evaluating large language model (LLM) controlled robots for practical intelligence, defined as the ability to navigate the messiness of the physical world. Current state-of-the-art robotic systems use a hierarchical architecture with LLMs in charge of high-level reasoning, and a Vis...","url_abs":"https://arxiv.org/abs/2510.21860","url_pdf":"https://arxiv.org/pdf/2510.21860v1","authors":"[\"Callum Sharrock\",\"Lukas Petersson\",\"Hanna Petersson\",\"Axel Backlund\",\"Axel Wennström\",\"Kristoffer Nordström\",\"Elias Aronsson\"]","published":"2025-10-23T07:28:28Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}