{"ID":2828757,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.14754","arxiv_id":"2512.14754","title":"Revisiting the Reliability of Language Models in Instruction-Following","abstract":"Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications. What's more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.","short_abstract":"Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-ori...","url_abs":"https://arxiv.org/abs/2512.14754","url_pdf":"https://arxiv.org/pdf/2512.14754v3","authors":"[\"Jianshuo Dong\",\"Yutong Zhang\",\"Yan Liu\",\"Zhenyu Zhong\",\"Tao Wei\",\"Chao Zhang\",\"Han Qiu\"]","published":"2025-12-15T02:57:55Z","proceeding":"cs.SE","tasks":"[\"cs.SE\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":605900,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2828757,"paper_url":"https://arxiv.org/abs/2512.14754","paper_title":"Revisiting the Reliability of Language Models in Instruction-Following","repo_url":"https://github.com/jianshuod/IFEval-pp","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
