{"ID":2921837,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T05:56:00.181519634Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01400","arxiv_id":"2606.01400","title":"Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs","abstract":"Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt selection framework that models each benchmark as a similarity graph -- nodes are prompts connected if their embedding-space distance falls above a configurable threshold -- and applies Maximum Independent Set (MIS) algorithms to select a maximally diverse, non-redundant subset. We evaluate four MIS solvers (CPLEX, GREEDY, Online-MIS, ReduMIS) across six embedding models, three distance measures, six percentile thresholds, and four benchmarks (GPQA, IFEval, MMLU-Pro, Omni-MATH) covering 66 LLMs. Our central hypothesis -- that repeated selection under different random seeds yields consistent LLM rankings that may also differ from the full-benchmark baseline -- is strongly confirmed: Kendall's $W \\geq 0.90$ in 99.2\\% of stochastic configurations (mean $W = 0.997 \\pm 0.008$), while at higher percentile thresholds selected subsets achieve 25--48\\% prompt reduction on average. Ranking divergence from the full benchmark ($ρ\u003c 0.95$) occurs in only 15.95\\% of configurations, concentrated at low thresholds ($p_{10}$--$p_{20}$) and benchmarks (GPQA, IFEval), identifying overly dense graphs as the primary failure mode.","short_abstract":"Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt selection framework that models each benchmark as a similarity graph -- nodes are prompts connected if their embedding-space distance falls above a configurable threshold -- and appli...","url_abs":"https://arxiv.org/abs/2606.01400","url_pdf":"https://arxiv.org/pdf/2606.01400v1","authors":"[\"Denica Kjorvezir\",\"Marko Djukanović\",\"Ana Gjorgjevikj\",\"Gjorgjina Cenikj\",\"Tome Eftimov\"]","published":"2026-05-31T18:45:12Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}