{"ID":3049921,"CreatedAt":"2026-06-04T02:13:16.786527022Z","UpdatedAt":"2026-06-06T15:44:26.945507316Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05104","arxiv_id":"2606.05104","title":"Knowledge Index of Noah's Ark","abstract":"Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness through a proxy, yielding a (1-1/e) greedy approximation (Proposition 1); the guarantee applies to the proxy, not to population representativeness. Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B \u003e Delta C / Delta p_min (Theorem 1). Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation. The full leaderboard shows a tiered structure rather than a smooth total order: a small frontier tier lies above 48%, a dense strong-model tier spans roughly 38-45%, and low-performing models remain only modestly above the 10% chance baseline. Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models. We report bootstrap ranking-stability statistics to make bounded-budget variance explicit and to discourage over-interpretation of adjacent ranks.","short_abstract":"Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained discipli...","url_abs":"https://arxiv.org/abs/2606.05104","url_pdf":"https://arxiv.org/pdf/2606.05104v1","authors":"[\"Sheng Jin\",\"Minghao Liu\",\"Yunze Xiao\",\"Zeqi Zhou\",\"Heli Qi\",\"Yifan Yao\",\"Meishu Song\",\"Kaijing Ma\",\"Xuan Zhang\",\"Sicong Jiang\",\"Yizhe Li\",\"Ningshan Ma\",\"Jie Wei\",\"Ziniu Li\",\"Minglai Yang\",\"Bangya Liu\",\"Yiming Liang\",\"Xiao Fang\",\"Qingcheng Zeng\",\"Jiarui Liu\",\"Rui Yang\",\"Shen Yan\",\"Wenhao Huang\",\"Jiaheng Liu\",\"Zihan Wang\",\"Weihao Xuan\",\"Ge Zhang\"]","published":"2026-06-03T17:06:49Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}
