{"ID":3050095,"CreatedAt":"2026-06-04T02:13:16.786527022Z","UpdatedAt":"2026-06-06T11:27:32.998563389Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.04749","arxiv_id":"2606.04749","title":"COP-Q: Safety-First Reinforcement Learning for Robot Control via Cholesky-Ordered Projection","abstract":"Safe robot control requires maximizing return while satisfying safety constraints. In off-policy safe reinforcement learning, reward and safety Q-values are commonly learned by separate critic ensembles, with uncertainty handled independently for each objective. This objective-wise treatment neglects inter-objective correlation and can lead to overly conservative value estimates, thereby reducing sample efficiency. To address this issue, we propose Cholesky-Ordered Projection Q-learning (COP-Q), a safety-first method that incorporates inter-objective covariance into vector-valued Q-value estimation. COP-Q constructs a generalized confidence bound in the joint Q-value space and uses Cholesky factorization to encode objective priority in a sequential form. This preserves conservatism on safety while adaptively reducing excessive conservatism on the reward objective. The resulting estimate is used in both temporal-difference target computation and actor optimization. COP-Q incurs minimal computational overhead and is readily compatible with most existing deep Q-learning frameworks. Experiments on robot locomotion in Brax and safe navigation in Safety-Gymnasium, covering both hard- and soft-safety settings, demonstrate that COP-Q achieves strong safety performance together with competitive or improved sample efficiency relative to representative baselines.","short_abstract":"Safe robot control requires maximizing return while satisfying safety constraints. In off-policy safe reinforcement learning, reward and safety Q-values are commonly learned by separate critic ensembles, with uncertainty handled independently for each objective. This objective-wise treatment neglects inter-objective co...","url_abs":"https://arxiv.org/abs/2606.04749","url_pdf":"https://arxiv.org/pdf/2606.04749v1","authors":"[\"Guopeng Li\",\"Moritz A. Zanger\",\"Matthijs T. J. Spaan\",\"Julian F. P. Kooij\"]","published":"2026-06-03T11:30:10Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.LG\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}
