{"ID":2848962,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.24150","arxiv_id":"2510.24150","title":"Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean","abstract":"We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models -- two multilingual and two Korean-specialized -- show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.","short_abstract":"We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistenc...","url_abs":"https://arxiv.org/abs/2510.24150","url_pdf":"https://arxiv.org/pdf/2510.24150v1","authors":"[\"Chanwoo Park\",\"Suyoung Park\",\"JiA Kang\",\"Jongyeon Park\",\"Sangho Kim\",\"Hyunji M. Park\",\"Sumin Bae\",\"Mingyu Kang\",\"Jaejin Lee\"]","published":"2025-10-28T07:42:59Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}