{"ID":2828997,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.13415","arxiv_id":"2512.13415","title":"USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition","abstract":"Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies. To address these limitations, we propose the Unified Spatio-Temporal Modeling (USTM) framework, a spatio-temporal encoder that effectively models complex patterns using a combination of a Swin Transformer backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE). Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities. Extensive experiments on benchmarked datasets including PHOENIX14, PHOENIX14T, and CSL-Daily demonstrate that USTM achieves state-of-the-art performance against RGB-based as well as multi-modal CSLR approaches, while maintaining competitive performance against multi-stream approaches. These results highlight the strength and efficacy of the USTM framework for CSLR. The code is available at https://github.com/gufranSabri/USTM","short_abstract":"Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and f...","url_abs":"https://arxiv.org/abs/2512.13415","url_pdf":"https://arxiv.org/pdf/2512.13415v2","authors":"[\"Ahmed Abul Hasanaath\",\"Hamzah Luqman\"]","published":"2025-12-15T15:05:16Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Transformer\",\"Convolutional Neural Network\"]","has_code":false,"code_links":[{"ID":605920,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2828997,"paper_url":"https://arxiv.org/abs/2512.13415","paper_title":"USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition","repo_url":"https://github.com/gufranSabri/USTM","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
