{"ID":2923573,"CreatedAt":"2026-06-02T04:05:25.881865328Z","UpdatedAt":"2026-06-04T13:12:39.622923895Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.02404","arxiv_id":"2606.02404","title":"K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts","abstract":"Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\\%, and we report this split separately as a targeted stress test. We publicly release our data and code.","short_abstract":"Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-p...","url_abs":"https://arxiv.org/abs/2606.02404","url_pdf":"https://arxiv.org/pdf/2606.02404v1","authors":"[\"Nahyun Lee\",\"Dongkeun Yoon\",\"Guijin Son\",\"Geewook Kim\",\"Dayoon Ko\",\"Jeonghun Park\",\"Haneul Yoo\",\"Jaewon Cho\",\"Junghun Park\",\"Changyoon Lee\",\"Kyochul Jang\",\"Jaeyeon Kim\",\"Eunsu Kim\",\"Woojin Cho\",\"Seungone Kim\"]","published":"2026-06-01T15:50:03Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}
