{"ID":2860762,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.02695","arxiv_id":"2510.02695","title":"RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization","abstract":"In safety-critical domains where online data collection is infeasible, offline reinforcement learning (RL) offers an attractive alternative but only if policies deliver high returns without incurring catastrophic lower-tail risk. Prior work on risk-averse offline RL achieves safety at the cost of value or model-based pessimism, and restricted policy classes that limit policy expressiveness, whereas diffusion/flow-based expressive generative policies trained with a behavioral-cloning (BC) objective have been used only in risk-neutral settings. Here, we address this gap by introducing the \\textbf{Risk-Aware Multimodal Actor-Critic (RAMAC)}, which couples an expressive generative actor with a distributional critic and, to our knowledge, is the first model-free approach that learns \\emph{risk-aware expressive generative policies}. RAMAC differentiates a composite objective that adds a Conditional Value-at-Risk (CVaR) term to a BC loss, achieving risk-sensitive learning in complex multimodal scenarios. Since out-of-distribution (OOD) actions are a major driver of catastrophic failures in offline RL, we further analyze OOD behavior under prior-anchored perturbation schemes from recent BC-regularized risk-averse offline RL. This clarifies why a behavior-regularized objective that directly constrains the expressive generative policy to the dataset support provides an effective, risk-agnostic mechanism for suppressing OOD actions in modern expressive policies. We instantiate RAMAC with a diffusion-based actor, using it both to illustrate the analysis in a 2-D risky bandit and to deploy OOD-action detectors on Stochastic-D4RL benchmarks, empirically validating our insights. Across these tasks, we observe consistent gains in $\\mathrm{CVaR}_{0.1}$ while maintaining strong returns. Our implementation is available at GitHub: https://github.com/KaiFukazawa/RAMAC.git","short_abstract":"In safety-critical domains where online data collection is infeasible, offline reinforcement learning (RL) offers an attractive alternative but only if policies deliver high returns without incurring catastrophic lower-tail risk. Prior work on risk-averse offline RL achieves safety at the cost of value or model-based p...","url_abs":"https://arxiv.org/abs/2510.02695","url_pdf":"https://arxiv.org/pdf/2510.02695v2","authors":"[\"Kai Fukazawa\",\"Kunal Mundada\",\"Iman Soltani\"]","published":"2025-10-03T03:22:21Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\"]","has_code":false,"code_links":[{"ID":608754,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2860762,"paper_url":"https://arxiv.org/abs/2510.02695","paper_title":"RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization","repo_url":"https://github.com/KaiFukazawa/RAMAC.git","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
