{"ID":2857421,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.09127","arxiv_id":"2510.09127","title":"Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback","abstract":"We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over $K$ actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class $Π$ we establish an optimal expected regret bound of $ O (\\sqrt{KT \\log |Π|} + \\sqrt{D \\log |Π|)} $ where $D$ is the sum of delays. For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class $ \\mathcal{F} $ with access to an online least-square regression oracle $\\mathcal{O}$ over $\\mathcal{F}$. In this setting, we achieve an expected regret bound of $O(\\sqrt{KT\\mathcal{R}_T(\\mathcal{O})} + \\sqrt{ d_{\\max} D β})$ assuming FIFO order, where $d_{\\max}$ is the maximal delay, $\\mathcal{R}_T(\\mathcal{O})$ is an upper bound on the oracle's regret and $β$ is a stability parameter associated with the oracle. We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk's aggregating forecaster as an oracle implementation for least-square regression over a finite function class $\\mathcal{F}$ and show that its stability parameter $β$ is bounded by $\\log |\\mathcal{F}|$, resulting in an expected regret bound of $O(\\sqrt{KT \\log |\\mathcal{F}|} + \\sqrt{d_{\\max} D \\log |\\mathcal{F}|})$ which is a $\\sqrt{d_{\\max}}$ factor away from the lower bound of $Ω(\\sqrt{KT \\log |\\mathcal{F}|} + \\sqrt{D \\log |\\mathcal{F}|})$ that we also present.","short_abstract":"We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over $K$ actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class $Π$ we establish an...","url_abs":"https://arxiv.org/abs/2510.09127","url_pdf":"https://arxiv.org/pdf/2510.09127v1","authors":"[\"Orin Levy\",\"Liad Erez\",\"Alon Cohen\",\"Yishay Mansour\"]","published":"2025-10-10T08:28:23Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[]","has_code":false}
