{"ID":2874344,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.05440","arxiv_id":"2509.05440","title":"Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too","abstract":"As large-language models have been increasingly used as automatic raters for evaluating free-form content, including document summarization, dialog, and story generation, work has been dedicated to evaluating such models by measuring their correlations with human judgment. For \\textit{sample-level} performance, methods which operate by using pairwise comparisons between machine-generated text perform well but often lack the ability to assign absolute scores to individual summaries, an ability crucial for use cases that require thresholding. In this work, we propose a direct-scoring method which uses synthetic summaries to act as pairwise machine rankings at test time. We show that our method performs comparably to state-of-the-art pairwise evaluators in terms of axis-averaged sample-level correlations on the SummEval (\\textbf{+0.03}), TopicalChat (\\textbf{-0.03}), and HANNA (\\textbf{+0.05}) meta-evaluation benchmarks, and release the synthetic in-context summaries as data to facilitate future work.","short_abstract":"As large-language models have been increasingly used as automatic raters for evaluating free-form content, including document summarization, dialog, and story generation, work has been dedicated to evaluating such models by measuring their correlations with human judgment. For \\textit{sample-level} performance, methods...","url_abs":"https://arxiv.org/abs/2509.05440","url_pdf":"https://arxiv.org/pdf/2509.05440v1","authors":"[\"Logan Lawrence\",\"Ashton Williamson\",\"Alexander Shelton\"]","published":"2025-09-05T18:48:34Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
