{"ID":3006134,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-04T19:14:31.964469513Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.02962","arxiv_id":"2606.02962","title":"Hand Trajectory Fusion for Egocentric Natural Language Query Grounding","abstract":"Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes.We propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3), indicating that hand trajectory provides grounding cues beyond appearance alone.","short_abstract":"Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a mome...","url_abs":"https://arxiv.org/abs/2606.02962","url_pdf":"https://arxiv.org/pdf/2606.02962v1","authors":"[\"Enmin Zhong\",\"Carlos R. del-Blanco\",\"Fernando Jaureguizar\",\"Narciso García\"]","published":"2026-06-01T23:46:18Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.HC\",\"eess.IV\"]","methods":"[]","has_code":false}
