{"ID":3084701,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-06T20:54:36.964885582Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05464","arxiv_id":"2606.05464","title":"Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces","abstract":"Verifiable reward training has improved mathematical and coding reasoning, but these domains capture only part of step-by-step decision making. Many real-world tasks require finding a high-value feasible plan among many valid alternatives. We introduce OPT*, a scalable family of optimization-style tasks for training and evaluating LLM step-by-step optimization-like reasoning along a complexity axis: each task provides a feasibility checker and evaluator, while a complexity parameter expands the search space without requiring new human labels. This motivates studying these tasks in two regimes: (i) solver-guided online policy optimization, which uses a solver as a value oracle for partial states and applies rank-based reward shaping to reinforce better next steps, and (ii) search-based offline RL when such solvers are unavailable. Theoretically, we relate success in large search spaces to the information a reasoner extracts per unit of search budget. Empirically, we ablate the ingredients that make search efficient on OPT* and show that training on OPT* improves step-by-step optimization-like reasoning.","short_abstract":"Verifiable reward training has improved mathematical and coding reasoning, but these domains capture only part of step-by-step decision making. Many real-world tasks require finding a high-value feasible plan among many valid alternatives. We introduce OPT*, a scalable family of optimization-style tasks for training an...","url_abs":"https://arxiv.org/abs/2606.05464","url_pdf":"https://arxiv.org/pdf/2606.05464v1","authors":"[\"Nicolás Astorga\",\"Nabeel Seedat\",\"Mihaela van der Schaar\"]","published":"2026-06-03T21:43:38Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}
