{"ID":2872658,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.08689","arxiv_id":"2509.08689","title":"Augmenting speech transcripts of VR recordings with gaze, pointing, and visual context for multimodal coreference resolution","abstract":"Understanding transcripts of immersive multimodal conversations is challenging because speakers frequently rely on visual context and non-verbal cues, such as gestures and visual attention, which are not captured in speech alone. This lack of information makes coreferences resolution-the task of linking ambiguous expressions like ``it'' or ``there'' to their intended referents-particularly challenging. In this paper we present a system that augments VR speech transcript with eye-tracking laser pointing data, and scene metadata to generate textual descriptions of non-verbal communication and the corresponding objects of interest. To evaluate the system, we collected gaze, gesture, and voice data from 12 participants (6 pairs) engaged in an open-ended design critique of a 3D model of an apartment. Our results show a 26.5\\% improvement in coreference resolution accuracy by a GPT model when using our multimodal transcript compared to a speech-only baseline.","short_abstract":"Understanding transcripts of immersive multimodal conversations is challenging because speakers frequently rely on visual context and non-verbal cues, such as gestures and visual attention, which are not captured in speech alone. This lack of information makes coreferences resolution-the task of linking ambiguous expre...","url_abs":"https://arxiv.org/abs/2509.08689","url_pdf":"https://arxiv.org/pdf/2509.08689v1","authors":"[\"Riccardo Bovo\",\"Frederik Brudy\",\"George Fitzmaurice\",\"Fraser Anderson\"]","published":"2025-09-10T15:27:17Z","proceeding":"cs.HC","tasks":"[\"cs.HC\"]","methods":"[]","has_code":false}
