{"ID":2923502,"CreatedAt":"2026-06-02T04:05:25.881865328Z","UpdatedAt":"2026-06-04T17:36:40.748176825Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.02547","arxiv_id":"2606.02547","title":"Pluralistic Leaderboards","abstract":"Recent leaderboard-based evaluations of large language models aggregate user feedback by fitting a Bradley--Terry model to pairwise comparisons, producing a single global ranking based on a latent quality score. While appealing for its simplicity, this approach is incompatible with heterogeneous preferences: when LLMs are used across diverse tasks and use cases, users who favor fundamentally different model behaviors can be systematically misrepresented when collapsed into a single quality score. To address this issue, we study \\emph{pluralistic leaderboards} that aim to remain \\emph{stable} with respect to heterogeneous user populations. Drawing on ideas from social choice theory, we adapt the notion of \\emph{local stability}, which requires that no model outside the top-$k$ positions is collectively preferred to the top-$k$ set by more than $O(1/k)$ fraction of users. Building on techniques from the social choice literature, we design an alternative leaderboard mechanism that satisfies local stability while eliciting only $\\widetilde{O}(k)$ pairwise comparisons per user, where $k$ is the size of the prefix for which stability is guaranteed. Using data from LMArena, we show that standard Bradley--Terry aggregation can violate local stability in practice, whereas our method provides substantially stronger stability guarantees.","short_abstract":"Recent leaderboard-based evaluations of large language models aggregate user feedback by fitting a Bradley--Terry model to pairwise comparisons, producing a single global ranking based on a latent quality score. While appealing for its simplicity, this approach is incompatible with heterogeneous preferences: when LLMs...","url_abs":"https://arxiv.org/abs/2606.02547","url_pdf":"https://arxiv.org/pdf/2606.02547v1","authors":"[\"Nika Haghtalab\",\"Ariel D. Procaccia\",\"Han Shao\",\"Serena Lutong Wang\",\"Kunhe Yang\"]","published":"2026-06-01T17:49:02Z","proceeding":"cs.GT","tasks":"[\"cs.GT\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
