{"ID":2860411,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.04363","arxiv_id":"2510.04363","title":"MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models","abstract":"We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize reusable browser-automation programs (macros) from natural-language goals by reading HTML/DOM and emitting Selenium. MacroBench instantiates seven self-hosted sites covering 681 tasks across interaction complexity and targeting difficulty. Our end-to-end protocol validates generated code via static checks, sandboxed execution, and outcome verification (DOM assertions, database snapshots), and includes a safety suite for scraping, spam/abuse, and credential/privacy prompts. Across 2,636 model-task runs, we observe stratified success: GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini (89.0%), DeepSeek (83.4%). Models handle simple tasks reliably (91.7%) but fail on complex workflows (0.0%), and none meet production-quality coding practices despite functional completion. We release our complete benchmark pipeline, evaluation framework, and experimental results at https://github.com/hyunjun1121/MacroBench to enable reproducible assessment of macro synthesis for web automation.","short_abstract":"We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize reusable browser-automation programs (macros) from natural-language goals by reading HTML/DOM and emitting Selenium. MacroBench instantiates seven self-hosted sites covering 681 tasks across interaction complexity and targeting d...","url_abs":"https://arxiv.org/abs/2510.04363","url_pdf":"https://arxiv.org/pdf/2510.04363v2","authors":"[\"Hyunjun Kim\",\"Sejong Kim\"]","published":"2025-10-05T21:15:11Z","proceeding":"cs.SE","tasks":"[\"cs.SE\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608727,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2860411,"paper_url":"https://arxiv.org/abs/2510.04363","paper_title":"MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models","repo_url":"https://github.com/hyunjun1121/MacroBench","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
