{"ID":2831703,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.07558","arxiv_id":"2512.07558","title":"ReLaX: Reasoning with Latent Exploration for Large Reasoning Models","abstract":"Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often drives the policy toward over-determinism, resulting in ineffective exploration and premature policy convergence. While promoting token-level diversity has shown promise in mitigating entropy collapse, we argue that the latent dynamics underlying token generation encode a far richer computational structure for steering policy optimization toward a more effective exploration-exploitation tradeoff. To enable tractable analysis and intervention of the latent dynamics of LRMs, we leverage Koopman operator theory to obtain a linearized representation of their hidden state dynamics. This enables us to introduce Dynamic Spectral Dispersion (DSD), a new metric to quantify the heterogeneity of the model's latent dynamics, serving as a direct indicator of policy exploration. Building upon these foundations, we propose Reasoning with Latent eXploration (ReLaX), a framework that explicitly incorporates latent dynamics to regulate exploration and exploitation during policy optimization. Comprehensive experiments across a wide range of multimodal and text-only reasoning benchmarks show that ReLaX consistently incentivizes reasoning capability and outperforms existing token-level methods. Our project is available at https://github.com/ZhangShimin1/ReLaX.","short_abstract":"Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often drives the policy toward over-determinism, resulting in ineffective exploration and premature policy convergence. While promoti...","url_abs":"https://arxiv.org/abs/2512.07558","url_pdf":"https://arxiv.org/pdf/2512.07558v2","authors":"[\"Shimin Zhang\",\"Xianwei Chen\",\"Yufan Shen\",\"Ziyuan Ye\",\"Jibin Wu\"]","published":"2025-12-08T13:48:33Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"LoRA\"]","has_code":false,"code_links":[{"ID":606155,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2831703,"paper_url":"https://arxiv.org/abs/2512.07558","paper_title":"ReLaX: Reasoning with Latent Exploration for Large Reasoning Models","repo_url":"https://github.com/ZhangShimin1/ReLaX","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
