{"ID":2921595,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T05:39:40.850681047Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01034","arxiv_id":"2606.01034","title":"A Finite-Calibration Regime Map for LLM Judge Panels","abstract":"We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas joint-table calibrators can represent interactions but pay for cell counts and unseen patterns. We cast this tradeoff as a finite-calibration regime map and instantiate it as Finite-Calibration Panel Selection, a deployable validation selector over judge path, prefix size, and aggregator family with table and parametric estimation diagnostics. On RewardBench, LLMBar, SummEval, and Arena100K with a seven-judge pool including DeepSeek V4 Flash, scalar/reliability aggregation wins 16 of 20 real dataset--budget cells, indicating that current judge outputs are often additive or redundant. Controlled calibration-growth data show the complementary regime: additive labels remain scalar-favored, whereas a six-way interaction selects a larger joint table and its test MSE drops from 0.224 to 0.061 once unseen mass vanishes. Thus the practical question is not ``how many judges?'' but whether the next judge's information is estimable under the available human labels.","short_abstract":"We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas joint-table calibrators can represent interactions but pay for cell counts and unseen patterns...","url_abs":"https://arxiv.org/abs/2606.01034","url_pdf":"https://arxiv.org/pdf/2606.01034v1","authors":"[\"Bin Zhu\",\"Yanghui Rao\"]","published":"2026-05-31T05:50:27Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"stat.ME\"]","methods":"[\"Large Language Model\"]","has_code":false}
