{"ID":3005070,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-05T07:50:16.0004273Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.03318","arxiv_id":"2606.03318","title":"Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions","abstract":"Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions and lacks experience-oriented evaluation. These limitations fail to account for the ambiguity, uncooperative behaviors, and shifting intentions characteristic of real-world users. To fill this gap, we propose RUT-Bench, a dedicated benchmark designed to assess LLMs under diverse Real-world User Tool calling scenarios. RUT-Bench supports high-fidelity simulations covering both ideal rational patterns and heterogeneous non-ideal behaviors across single-turn and multi-turn dialogues. We conduct comprehensive evaluations on 19 widely adopted open-source and proprietary LLMs using our benchmark. Experimental results reveal that no tested LLMs achieve an overall success rate above 40%, and nearly all of them experience noticeable performance drops when facing more complicated non-ideal user inputs. Our code and data is available at https://github.com/TorresYangX/RUT-Bench.","short_abstract":"Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions and lacks experience-oriented evaluation. These limitations fail to account for the ambig...","url_abs":"https://arxiv.org/abs/2606.03318","url_pdf":"https://arxiv.org/pdf/2606.03318v1","authors":"[\"Xuan Yang\",\"Hao Xu\",\"Tingfeng Hui\",\"Hongsheng Xin\",\"Kaike Zhang\",\"Chunxiao Liu\",\"Ning Miao\"]","published":"2026-06-02T08:28:37Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":612730,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-03T03:09:48.883664427Z","DeletedAt":null,"paper_id":3005070,"paper_url":"https://arxiv.org/abs/2606.03318","paper_title":"Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions","repo_url":"https://github.com/TorresYangX/RUT-Bench","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
