{"ID":2825062,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.22336","arxiv_id":"2512.22336","title":"Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback","abstract":"Symbolic world models (e.g., PDDL domains or executable simulators) are central to model-based planning, but training LLMs to generate such world models is limited by the lack of large-scale verifiable supervision. Current approaches rely primarily on static validation methods that fail to catch behavior-level errors arising from interactive execution. In this paper, we propose Agent2World, a tool-augmented multi-agent framework that achieves strong inference-time world-model generation and also serves as a data engine for supervised fine-tuning, by grounding generation in multi-agent feedback. Agent2World follows a three-stage pipeline: (i) A Deep Researcher agent performs knowledge synthesis by web searching to address specification gaps; (ii) A Model Developer agent implements executable world models; And (iii) a specialized Testing Team conducts adaptive unit testing and simulation-based validation. Agent2World demonstrates superior inference-time performance across three benchmarks spanning both Planning Domain Definition Language (PDDL) and executable code representations, achieving consistent state-of-the-art results. Beyond inference, Testing Team serves as an interactive environment for the Model Developer, providing behavior-aware adaptive feedback that yields multi-turn training trajectories. The model fine-tuned on these trajectories substantially improves world-model generation, yielding an average relative gain of 30.95% over the same model before training. Project page: https://agent2world.github.io.","short_abstract":"Symbolic world models (e.g., PDDL domains or executable simulators) are central to model-based planning, but training LLMs to generate such world models is limited by the lack of large-scale verifiable supervision. Current approaches rely primarily on static validation methods that fail to catch behavior-level errors a...","url_abs":"https://arxiv.org/abs/2512.22336","url_pdf":"https://arxiv.org/pdf/2512.22336v1","authors":"[\"Mengkang Hu\",\"Bowei Xia\",\"Yuran Wu\",\"Ailing Yu\",\"Yude Zou\",\"Qiguang Chen\",\"Shijian Wang\",\"Jiarui Jin\",\"Kexin Li\",\"Wenxiang Jiao\",\"Yuan Lu\",\"Ping Luo\"]","published":"2025-12-26T18:54:14Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}
