{"ID":2833057,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.04921","arxiv_id":"2512.04921","title":"The AI Consumer Index (ACE)","abstract":"We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform everyday consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) at 55.2% and GPT 5.1 (Thinking = High) at 55.1%. Model scores differ across domains, and in Shopping the top model scores under 50\\%. We find that models are prone to hallucinating key information, such as prices. ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.","short_abstract":"We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform everyday consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a de...","url_abs":"https://arxiv.org/abs/2512.04921","url_pdf":"https://arxiv.org/pdf/2512.04921v3","authors":"[\"Julien Benchek\",\"Rohit Shetty\",\"Benjamin Hunsberger\",\"Ajay Arun\",\"Zach Richards\",\"Brendan Foody\",\"Osvald Nitski\",\"Bertie Vidgen\"]","published":"2025-12-04T15:54:28Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\",\"cs.HC\"]","methods":"[]","has_code":false}
