{"ID":2857772,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.09872","arxiv_id":"2510.09872","title":"WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions","abstract":"Training web agents to navigate complex, real-world websites requires them to master $\\textit{subtasks}$ - short-horizon interactions on multiple UI components (e.g., choosing the correct date in a date picker, or scrolling in a container to extract information). We introduce WARC-Bench (Web Archive Benchmark), a novel web navigation benchmark featuring 438 tasks designed to evaluate multimodal AI agents on subtasks. WARC-Bench enables sandboxed interactions with dynamic and realistic webpages using Web ARChive files. We show that WARC-Bench is challenging for leading computer-use models, with the highest observed success rate being 64.8%. To improve open source models on subtask, we explore two common training techniques: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). Experiments show that SFT models obtain a 48.8% success rate on the benchmark. Training with RLVR over SFT checkpoints, even in data-scarce settings, improves the score to 52.8% on WARC-Bench, outperforming many frontier models. Our analysis concludes that mastering these subtasks is essential for robust web planning and navigation, and is a capability not extensively evaluated by existing benchmarks.","short_abstract":"Training web agents to navigate complex, real-world websites requires them to master $\\textit{subtasks}$ - short-horizon interactions on multiple UI components (e.g., choosing the correct date in a date picker, or scrolling in a container to extract information). We introduce WARC-Bench (Web Archive Benchmark), a novel...","url_abs":"https://arxiv.org/abs/2510.09872","url_pdf":"https://arxiv.org/pdf/2510.09872v2","authors":"[\"Sanjari Srivastava\",\"Gang Li\",\"Cheng Chang\",\"Rishu Garg\",\"Manpreet Kaur\",\"Charlene Y. Lee\",\"Yuezhang Li\",\"Yining Mao\",\"Ignacio Cases\",\"Yanan Xie\",\"Peng Qi\"]","published":"2025-10-10T21:20:51Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}
