{"ID":2823395,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2601.00223","arxiv_id":"2601.00223","title":"JP-TL-Bench: Anchored Pairwise LLM Evaluation for Bidirectional Japanese-English Translation","abstract":"We introduce JP-TL-Bench, a lightweight, open benchmark designed to guide the iterative development of Japanese-English translation systems. In this context, the challenge is often \"which of these two good translations is better?\" rather than \"is this translation acceptable?\" This distinction matters for Japanese-English, where subtle choices in politeness, implicature, ellipsis, and register strongly affect perceived naturalness. JP-TL-Bench uses a protocol built to make LLM judging both reliable and affordable: it evaluates a candidate model via reference-free, pairwise LLM comparisons against a fixed, versioned anchor set. Pairwise results are aggregated with a Bradley-Terry model and reported as win rates plus a normalized 0-10 \"LT\" score derived from a logistic transform of fitted log-strengths. Because each candidate is scored against the same frozen anchor set, scores are structurally stable given the same base set, judge, and aggregation code.","short_abstract":"We introduce JP-TL-Bench, a lightweight, open benchmark designed to guide the iterative development of Japanese-English translation systems. In this context, the challenge is often \"which of these two good translations is better?\" rather than \"is this translation acceptable?\" This distinction matters for Japanese-Engli...","url_abs":"https://arxiv.org/abs/2601.00223","url_pdf":"https://arxiv.org/pdf/2601.00223v1","authors":"[\"Leonard Lin\",\"Adam Lensenmayer\"]","published":"2026-01-01T06:09:45Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}