{"ID":2850392,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.22443","arxiv_id":"2510.22443","title":"Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents","abstract":"There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user's goal/query (e.g. \"Where did I leave my keys?\"). In this work, we consider the important complementary problem of inferring that goal from multi-modal contextual observations. Solving this \"goal inference\" problem holds the promise of eliminating the effort needed to interact with such an agent. This work focuses on creating WAGIBench, a strong benchmark to measure progress in solving this problem using vision-language models (VLMs). Given the limited prior work in this area, we collected a novel dataset comprising 29 hours of multimodal data from 348 participants across 3,477 recordings, featuring ground-truth goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We validate that human performance exceeds model performance, achieving 93% multiple-choice accuracy compared with 84% for the best-performing VLM. Generative benchmark results that evaluate several families of modern vision-language models show that larger models perform significantly better on the task, yet remain far from practical usefulness, as they produce relevant goals only 55% of the time. Through a modality ablation, we show that models benefit from extra information in relevant modalities with minimal performance degradation from irrelevant modalities.","short_abstract":"There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user's goal/query (e.g. \"Where did I leave my keys?\"). In this work, we consider the important complementary problem of inferring that goal from multi-modal...","url_abs":"https://arxiv.org/abs/2510.22443","url_pdf":"https://arxiv.org/pdf/2510.22443v1","authors":"[\"Vijay Veerabadran\",\"Fanyi Xiao\",\"Nitin Kamra\",\"Pedro Matias\",\"Joy Chen\",\"Caley Drooff\",\"Brett D Roads\",\"Riley Williams\",\"Ethan Henderson\",\"Xuanyi Zhao\",\"Kevin Carlberg\",\"Joseph Tighe\",\"Karl Ridgeway\"]","published":"2025-10-25T21:54:01Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
