{"ID":3084757,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T01:13:47.470342244Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05563","arxiv_id":"2606.05563","title":"SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations","abstract":"Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactive LLM mediators in realistic, multi-domain testbeds. It constructs scenarios from real conflicts through an agentic pipeline across eight domains, probes five socio-cognitive adaptation axes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via a topic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediated consensus gap under diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.","short_abstract":"Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We...","url_abs":"https://arxiv.org/abs/2606.05563","url_pdf":"https://arxiv.org/pdf/2606.05563v1","authors":"[\"Taewon Yun\",\"Hyeonseong Park\",\"Jeonghwan Choi\",\"Hayoon Park\",\"Yeeun Choi\",\"Hwanjun Song\"]","published":"2026-06-04T01:19:40Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}
