{"ID":2879964,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.15760","arxiv_id":"2508.15760","title":"LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries","abstract":"Tool calling has emerged as a critical capability for AI agents. In contrast to conventional tool calling frameworks that rely on static, provider-specific tool definitions, the Model Context Protocol (MCP) offers a unified interface to discover and invoke tools dynamically. However, there is a significant gap in benchmarking multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 real-world queries that require coordinated use of multiple MCP tools. To address temporal variability in real-world tool responses, we introduce a parallel evaluation framework where a reference agent executes a validated plan simultaneously to produce real-time reference outputs. Experiments show that even frontier LLMs achieve a success rate below 60\\%, highlighting challenges in multi-step tool use. Comprehensive error analysis identifies seven failure modes spanning tool planning, parameterization, and output handling, pointing to concrete directions for improving current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous agent systems that reliably execute complex tasks through MCP tool orchestration.","short_abstract":"Tool calling has emerged as a critical capability for AI agents. In contrast to conventional tool calling frameworks that rely on static, provider-specific tool definitions, the Model Context Protocol (MCP) offers a unified interface to discover and invoke tools dynamically. However, there is a significant gap in bench...","url_abs":"https://arxiv.org/abs/2508.15760","url_pdf":"https://arxiv.org/pdf/2508.15760v2","authors":"[\"Ming Yin\",\"Dinghan Shen\",\"Silei Xu\",\"Sixun Dong\",\"Mian Zhang\",\"Yebowen Hu\",\"Shujian Liu\",\"Jianbing Han\",\"Simin Ma\",\"Song Wang\",\"Sathish Reddy Indurthi\",\"Xun Wang\",\"Yiran Chen\",\"Kaiqiang Song\"]","published":"2025-08-21T17:55:54Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}
