{"ID":2859609,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.06475","arxiv_id":"2510.06475","title":"PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles","abstract":"This work investigates the reasoning and planning capabilities of foundation models and their scalability in complex, dynamic environments. We introduce PuzzlePlex, a benchmark designed to assess these capabilities through a diverse set of puzzles. PuzzlePlex consists of 15 types of puzzles, including deterministic and stochastic games of varying difficulty, as well as single-player and two-player scenarios. The PuzzlePlex framework provides a comprehensive environment for each game, and supports extensibility to generate more challenging instances as foundation models evolve. Additionally, we implement customized game-playing strategies for comparison. Building on this benchmark, we develop fine-grained metrics to measure performance and conduct an in-depth analysis of frontier foundation models across two settings: instruction-based and code-based. Furthermore, we systematically investigate their scaling limits. Our findings show that reasoning models outperform others in instruction-based settings, while code-based execution presents greater challenges but offers a scalable and efficient alternative. PuzzlePlex enables targeted evaluation and guides future improvements in reasoning, planning, and generalization for foundation models.","short_abstract":"This work investigates the reasoning and planning capabilities of foundation models and their scalability in complex, dynamic environments. We introduce PuzzlePlex, a benchmark designed to assess these capabilities through a diverse set of puzzles. PuzzlePlex consists of 15 types of puzzles, including deterministic and...","url_abs":"https://arxiv.org/abs/2510.06475","url_pdf":"https://arxiv.org/pdf/2510.06475v1","authors":"[\"Yitao Long\",\"Yuru Jiang\",\"Hongjun Liu\",\"Yilun Zhao\",\"Jingchen Sun\",\"Yiqiu Shen\",\"Chen Zhao\",\"Arman Cohan\",\"Dennis Shasha\"]","published":"2025-10-07T21:24:29Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\"]","methods":"[]","has_code":false}
