{"ID":2897908,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.04406","arxiv_id":"2507.04406","title":"Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning","abstract":"We study reinforcement learning (RL) in the agnostic policy learning setting, where the goal is to find a policy whose performance is competitive with the best policy in a given class of interest $Π$ -- crucially, without assuming that $Π$ contains the optimal policy. We propose a general policy learning framework that reduces this problem to first-order optimization in a non-Euclidean space, leading to new algorithms as well as shedding light on the convergence properties of existing ones. Specifically, under the assumption that $Π$ is convex and satisfies a variational gradient dominance (VGD) condition -- an assumption known to be strictly weaker than more standard completeness and coverability conditions -- we obtain sample complexity upper bounds for three policy learning algorithms: \\emph{(i)} Steepest Descent Policy Optimization, derived from a constrained steepest descent method for non-convex optimization; \\emph{(ii)} the classical Conservative Policy Iteration algorithm \\citep{kakade2002approximately} reinterpreted through the lens of the Frank-Wolfe method, which leads to improved convergence results; and \\emph{(iii)} an on-policy instantiation of the well-studied Policy Mirror Descent algorithm. Finally, we empirically evaluate the VGD condition across several standard environments, demonstrating the practical relevance of our key assumption.","short_abstract":"We study reinforcement learning (RL) in the agnostic policy learning setting, where the goal is to find a policy whose performance is competitive with the best policy in a given class of interest $Π$ -- crucially, without assuming that $Π$ contains the optimal policy. We propose a general policy learning framework that...","url_abs":"https://arxiv.org/abs/2507.04406","url_pdf":"https://arxiv.org/pdf/2507.04406v1","authors":"[\"Uri Sherman\",\"Tomer Koren\",\"Yishay Mansour\"]","published":"2025-07-06T14:40:05Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"math.OC\",\"stat.ML\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}
