{"ID":3004644,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-05T11:43:53.432517148Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.03954","arxiv_id":"2606.03954","title":"VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring","abstract":"As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount -- physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision-Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. VLESA addresses intent-dependent safety where identical actions can be safe or dangerous depending on context. A dataset pairing egocentric frames with goal-conditioned safety annotations is introduced, enabling a goal-conditioned safety Q-filter trained via GRPO that evaluates actions with respect to inferred intent without retraining. On top of that, an intent-action prediction agent is proposed to jointly infer goals and predict future actions from video. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy at the exact ground-truth frame compared to baselines, while the GRPO-trained Q-filter improves action safety by over 41 percentage points through goal-conditioned constrained decoding. Code is available at https://github.com/HanjiangHu/VLESA.","short_abstract":"As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount -- physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision-Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and...","url_abs":"https://arxiv.org/abs/2606.03954","url_pdf":"https://arxiv.org/pdf/2606.03954v1","authors":"[\"Hanjiang Hu\",\"Yiyuan Pan\",\"Jiaxing Li\",\"Xusheng Luo\",\"Alexander Robey\",\"Na Li\",\"Yebin Wang\",\"Changliu Liu\"]","published":"2026-06-02T17:42:17Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.LG\",\"cs.RO\"]","methods":"[]","has_code":false,"code_links":[{"ID":612689,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-03T03:09:48.883664427Z","DeletedAt":null,"paper_id":3004644,"paper_url":"https://arxiv.org/abs/2606.03954","paper_title":"VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring","repo_url":"https://github.com/HanjiangHu/VLESA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}