{"ID":3049982,"CreatedAt":"2026-06-04T02:13:16.786527022Z","UpdatedAt":"2026-06-06T14:39:32.180964103Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.04970","arxiv_id":"2606.04970","title":"Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance","abstract":"We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \\textit{when} to interrupt, and \\textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \\textbf{(1)}~we release \\textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \\textbf{(2)}~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \\textbf{Pro\\textsuperscript{2}Bench} under a unified proactive-guidance schema; \\textbf{(3)}~we propose a \\textbf{decoupled planner--interaction architecture} specialized for procedural state, visual cues, and recovery injection; \\textbf{(4)}~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.","short_abstract":"We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \\textit{when} to interrupt, and \\textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions,...","url_abs":"https://arxiv.org/abs/2606.04970","url_pdf":"https://arxiv.org/pdf/2606.04970v1","authors":"[\"Kaustav Kundu\",\"Ritvik Shrivastava\",\"Maxim Arap\",\"Nanshu Wang\",\"Xianhui Zhu\",\"Quintin Fettes\",\"Gautam Tiwari\",\"Parth Suresh\",\"Théo Moutakanni\",\"Alejandro Castillejo Munoz\",\"Allen Bolourchi\",\"Pascale Fung\",\"Pinar Donmez\",\"Babak Damavandi\",\"Anuj Kumar\",\"Seungwhan Moon\"]","published":"2026-06-03T14:52:03Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[]","has_code":false}
