{"ID":2921077,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-04T07:41:34.29888543Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01869","arxiv_id":"2606.01869","title":"WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis","abstract":"Large language models (LLMs) are increasingly asked not only to write static interfaces, but to construct executable interactive worlds from natural language. Browser-native 3D, commonly built with Three.js, is a natural next frontier: generated programs must integrate assets, obey spatial and physical constraints, and keep user-facing controls synchronized with hidden runtime state. Existing web-generation benchmarks and evaluators, however, largely observe only pixels or DOM nodes, while the mechanics of a Three.js world unfold inside an opaque \u003ccanvas\u003e. We introduce WorldCoder-Bench, a benchmark for autonomous, physically grounded 3D world synthesis. WorldCoder-Bench contains 2,026 expert-curated tasks across Simulation, Rendering, and Application scenarios, with optional .glb assets and hidden behavioral contracts. We further propose StateProbe, an execution-based protocol that probes generated programs in a sandboxed browser and verifies hidden, mutation-hardened contracts over runtime states and transitions. Beyond verification coverage, we report Return on Automation and Time Efficiency Multiplier to measure correctness-adjusted cost and time savings. Across nine frontier models, the best system reaches only 27.8% verification coverage on WorldCoder-Core and 19.9% on WorldCoder-Robust, with failures dominated by state-schema drift and broken interaction chains rather than missing scene elements. Utility metrics further show that cheap or fast models can still provide substantial value on easier domains. WorldCoder-Bench is available at https://anonymous.4open.science/r/WorldCoder-Bench/.","short_abstract":"Large language models (LLMs) are increasingly asked not only to write static interfaces, but to construct executable interactive worlds from natural language. Browser-native 3D, commonly built with Three.js, is a natural next frontier: generated programs must integrate assets, obey spatial and physical constraints, and...","url_abs":"https://arxiv.org/abs/2606.01869","url_pdf":"https://arxiv.org/pdf/2606.01869v1","authors":"[\"Shuo Lu\",\"Yinuo Xu\",\"Kecheng Yu\",\"Siru Jiang\",\"Yongcan Yu\",\"Yubin Wang\",\"Haitao Yang\",\"Yuxiang Zhang\",\"Bin Wang\",\"Ran He\",\"Jian Liang\"]","published":"2026-06-01T08:17:49Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","project_urls":"[\"https://anonymous.4open.science/r/WorldCoder-Bench/\"]","has_code":false}
