{"ID":2864418,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.24002","arxiv_id":"2509.24002","title":"MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use","abstract":"MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of $127$ high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only $52.56$\\% pass@1 and $33.86$\\% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below $30$\\% pass@1 and $15$\\% pass^4. On average, LLMs require $16.2$ execution turns and $17.4$ tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.","short_abstract":"MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this ga...","url_abs":"https://arxiv.org/abs/2509.24002","url_pdf":"https://arxiv.org/pdf/2509.24002v1","authors":"[\"Zijian Wu\",\"Xiangyan Liu\",\"Xinyuan Zhang\",\"Lingjun Chen\",\"Fanqing Meng\",\"Lingxiao Du\",\"Yiran Zhao\",\"Fanshi Zhang\",\"Yaoqi Ye\",\"Jiawei Wang\",\"Zirui Wang\",\"Jinjie Ni\",\"Yufan Yang\",\"Arvin Xu\",\"Michael Qizhe Shieh\"]","published":"2025-09-28T17:53:27Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}
