{"ID":2867968,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.18420","arxiv_id":"2509.18420","title":"Instruction-Following Evaluation in Function Calling for Large Language Models","abstract":"Function calling is a core capability of large language models, essential for AI agents. Existing benchmarks such as the Berkeley Function Calling Leaderboard (BFCL), tau^2-Bench (arXiv:2506.07982), and ACEBench (arXiv:2501.12851) evaluate argument correctness but do not test adherence to format instructions embedded in parameter descriptions, such as enclosing values in double quotes or using ISO date formats. We introduce IFEval-FC, a benchmark inspired by IFEval (arXiv:2311.07911) that assesses precise instruction following in function calling. IFEval-FC encodes verifiable formats directly within JSON schema descriptions, for example specifying that a value must not contain punctuation. It includes 750 test cases, each consisting of a function with an embedded format for one of its input parameters and a corresponding user query. Evaluation is fully algorithmic, ensuring objectivity, reproducibility, and scalability. Our results show that even state-of-the-art proprietary models, including GPT-5 and Claude 4.1 Opus, frequently fail to follow basic formatting rules, highlighting a practical limitation for real-world agent systems. The complete codebase and data are publicly available at https://github.com/Skripkon/IFEval-FC.","short_abstract":"Function calling is a core capability of large language models, essential for AI agents. Existing benchmarks such as the Berkeley Function Calling Leaderboard (BFCL), tau^2-Bench (arXiv:2506.07982), and ACEBench (arXiv:2501.12851) evaluate argument correctness but do not test adherence to format instructions embedded i...","url_abs":"https://arxiv.org/abs/2509.18420","url_pdf":"https://arxiv.org/pdf/2509.18420v1","authors":"[\"Nikolai Skripko\"]","published":"2025-09-22T21:04:39Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":609528,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2867968,"paper_url":"https://arxiv.org/abs/2509.18420","paper_title":"Instruction-Following Evaluation in Function Calling for Large Language Models","repo_url":"https://github.com/Skripkon/IFEval-FC","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}