{"ID":2862296,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.01389","arxiv_id":"2510.01389","title":"INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models","abstract":"Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present \\textbf{INSIGHT}, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using $π_0$-FAST as the underlying model, we extract per-token \\emph{entropy}, \\emph{log-probability}, and Dirichlet-based estimates of \\emph{aleatoric and epistemic uncertainty}, and train compact transformer classifiers to map these sequences to help triggers. We explore supervision regimes for strong or weak supervision, and extensively compare them across in-distribution and out-of-distribution tasks. Our results show a trade-off: strong labels enable models to capture fine-grained uncertainty dynamics for reliable help detection, while weak labels, though noisier, still support competitive introspection when training and evaluation are aligned, offering a scalable path when dense annotation is impractical. Crucially, we find that modeling the temporal evolution of token-level uncertainty signals with transformers provides far greater predictive power than static sequence-level scores. This study provides the first systematic evaluation of uncertainty-based introspection in VLAs, opening future avenues for active learning and for real-time error mitigation through selective human intervention.","short_abstract":"Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present \\textbf{INSIGHT}, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should...","url_abs":"https://arxiv.org/abs/2510.01389","url_pdf":"https://arxiv.org/pdf/2510.01389v2","authors":"[\"Ulas Berk Karli\",\"Ziyao Shangguan\",\"Tesca FItzgerald\"]","published":"2025-10-01T19:22:48Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Transformer\"]","has_code":false}