{"ID":2850549,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.21275","arxiv_id":"2510.21275","title":"Investigating Scale Independent UCT Exploration Factor Strategies","abstract":"The Upper Confidence Bounds For Trees (UCT) algorithm is not agnostic to the reward scale of the game it is applied to. For zero-sum games with the sparse rewards of $\\{-1,0,1\\}$ at the end of the game, this is not a problem, but many games often feature dense rewards with hand-picked reward scales, causing a node's Q-value to span different magnitudes across different games. In this paper, we evaluate various strategies for adaptively choosing the UCT exploration constant $λ$, called $λ$-strategies, that are agnostic to the game's reward scale. These $λ$-strategies include those proposed in the literature as well as five new strategies. Given our experimental results, we recommend using one of our newly suggested $λ$-strategies, which is to choose $λ$ as $2 \\cdot σ$ where $σ$ is the empirical standard deviation of all state-action pairs' Q-values of the search tree. This method outperforms existing $λ$-strategies across a wide range of tasks both in terms of a single parameter value and the peak performances obtained by optimizing all available parameters.","short_abstract":"The Upper Confidence Bounds For Trees (UCT) algorithm is not agnostic to the reward scale of the game it is applied to. For zero-sum games with the sparse rewards of $\\{-1,0,1\\}$ at the end of the game, this is not a problem, but many games often feature dense rewards with hand-picked reward scales, causing a node's Q-...","url_abs":"https://arxiv.org/abs/2510.21275","url_pdf":"https://arxiv.org/pdf/2510.21275v1","authors":"[\"Robin Schmöcker\",\"Christoph Schnell\",\"Alexander Dockhorn\"]","published":"2025-10-24T09:19:14Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"LoRA\"]","has_code":false}
