{"ID":2892200,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.15501","arxiv_id":"2507.15501","title":"ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution","abstract":"This work evaluates the potential of large language models (LLMs) to power digital assistants capable of complex action execution. These assistants rely on pre-trained programming knowledge to execute multi-step goals by composing objects and functions defined in assistant libraries into action execution programs. To achieve this, we develop ASPERA, a framework comprising an assistant library simulation and a human-assisted LLM data generation engine. Our engine allows developers to guide LLM generation of high-quality tasks consisting of complex user queries, simulation state and corresponding validation programs, tackling data availability and evaluation robustness challenges. Alongside the framework we release Asper-Bench, an evaluation dataset of 250 challenging tasks generated using ASPERA, which we use to show that program generation grounded in custom assistant libraries is a significant challenge to LLMs compared to dependency-free code generation.","short_abstract":"This work evaluates the potential of large language models (LLMs) to power digital assistants capable of complex action execution. These assistants rely on pre-trained programming knowledge to execute multi-step goals by composing objects and functions defined in assistant libraries into action execution programs. To a...","url_abs":"https://arxiv.org/abs/2507.15501","url_pdf":"https://arxiv.org/pdf/2507.15501v1","authors":"[\"Alexandru Coca\",\"Mark Gaynor\",\"Zhenxing Zhang\",\"Jianpeng Cheng\",\"Bo-Hsiang Tseng\",\"Pete Boothroyd\",\"Héctor Martinez Alonso\",\"Diarmuid Ó Séaghdha\",\"Anders Johannsen\"]","published":"2025-07-21T11:07:05Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
