{"ID":2875679,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.01052","arxiv_id":"2509.01052","title":"FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games","abstract":"GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.","short_abstract":"GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity...","url_abs":"https://arxiv.org/abs/2509.01052","url_pdf":"https://arxiv.org/pdf/2509.01052v2","authors":"[\"Jaewoo Ahn\",\"Junseo Kim\",\"Heeseung Yun\",\"Jaehyeon Son\",\"Dongmin Park\",\"Jaewoong Cho\",\"Gunhee Kim\"]","published":"2025-09-01T01:33:16Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\",\"cs.CV\"]","methods":"[\"Large Language Model\"]","has_code":false}