{"ID":2873148,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.07968","arxiv_id":"2509.07968","title":"SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge","abstract":"We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.","short_abstract":"We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created throu...","url_abs":"https://arxiv.org/abs/2509.07968","url_pdf":"https://arxiv.org/pdf/2509.07968v2","authors":"[\"Lukas Haas\",\"Gal Yona\",\"Giovanni D'Antonio\",\"Sasha Goldshtein\",\"Dipanjan Das\"]","published":"2025-09-09T17:53:58Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","project_urls":"[\"https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified\"]","has_code":false}
