{"ID":2835114,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.00351","arxiv_id":"2512.00351","title":"Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning","abstract":"The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm \\emph{Memory-Efficient Nash Q-Learning (ME-Nash-QL)} for two-player zero-sum Markov games, which is a specific setting of MARL. ME-Nash-QL is proven to enjoy the following merits. First, it can output an $\\varepsilon$-approximate Nash policy with space complexity $O(SABH)$ and sample complexity $\\widetilde{O}(H^4SAB/\\varepsilon^2)$, where $S$ is the number of states, $\\{A, B\\}$ is the number of actions for two players, and $H$ is the horizon length. It outperforms existing algorithms in terms of space complexity for tabular cases, and in terms of sample complexity for long horizons, i.e., when $\\min\\{A, B\\}\\ll H^2$. Second, ME-Nash-QL achieves the lowest computational complexity $O(T\\mathrm{poly}(AB))$ while preserving Markov policies, where $T$ is the number of samples. Third, ME-Nash-QL also achieves the best burn-in cost $O(SAB\\,\\mathrm{poly}(H))$, whereas previous algorithms have a burn-in cost of at least $O(S^3 AB\\,\\mathrm{poly}(H))$ to attain the same level of sample complexity with ours.","short_abstract":"The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample comple...","url_abs":"https://arxiv.org/abs/2512.00351","url_pdf":"https://arxiv.org/pdf/2512.00351v1","authors":"[\"Na Li\",\"Yuchen Jiao\",\"Hangguan Shan\",\"Shefeng Yan\"]","published":"2025-11-29T06:44:25Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"stat.ML\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}