{"ID":2855198,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.13512","arxiv_id":"2510.13512","title":"Offline and Online KL-Regularized RLHF under Differential Privacy","abstract":"In this paper, we study the offline and online settings of reinforcement learning from human feedback (RLHF) with KL-regularization -- a widely used objective function in large language model alignment -- under the $ε$ local differential privacy ($ε$-LDP) model on the label of the human preference. In the offline setting, we design an algorithm based on the principle of pessimism and derive a new suboptimality gap of $\\tilde{O}(1/[(e^ε-1)^2 n])$ on the KL-regularized objective under single-policy concentrability. We also prove its optimality by providing a matching lower bound where $n$ is the sample size. In the online setting, we are the first one to theoretically investigate the problem of KL-regularized RLHF with LDP. We design an optimism-based algorithm and derive a logarithmic regret bound of $O(d_{\\mathcal{F}}\\log (N_{\\mathcal{F}}\\cdot T) /(e^ε-1)^2 )$, where $T$ is the total time step, $N_{\\mathcal{F}}$ is cardinality of the reward function space $\\mathcal{F}$ and $d_{\\mathcal{F}}$ is a variant of eluder dimension for RLHF. As a by-product of our analysis, our results also imply the first analysis for online KL-regularized RLHF without privacy. We implement our algorithm in the offline setting to verify our theoretical results and release our open source code at: https://github.com/rushil-thareja/PPKL-RLHF-Official.","short_abstract":"In this paper, we study the offline and online settings of reinforcement learning from human feedback (RLHF) with KL-regularization -- a widely used objective function in large language model alignment -- under the $ε$ local differential privacy ($ε$-LDP) model on the label of the human preference. In the offline setti...","url_abs":"https://arxiv.org/abs/2510.13512","url_pdf":"https://arxiv.org/pdf/2510.13512v1","authors":"[\"Yulian Wu\",\"Rushil Thareja\",\"Praneeth Vepakomma\",\"Francesco Orabona\"]","published":"2025-10-15T13:04:19Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\",\"Language Model\",\"RLHF\"]","has_code":false,"code_links":[{"ID":608230,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2855198,"paper_url":"https://arxiv.org/abs/2510.13512","paper_title":"Offline and Online KL-Regularized RLHF under Differential Privacy","repo_url":"https://github.com/rushil-thareja/PPKL-RLHF-Official","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}