{"ID":2853837,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.15231","arxiv_id":"2510.15231","title":"Extending Audio Context for Long-Form Understanding in Large Audio-Language Models","abstract":"Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, modality-decoupled extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM's text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training. Our experiments on SALMONN and Qwen2-Audio confirm that Partial YaRN outperforms the original models across wide range of settings, and VLAT provides substantial performance improvement on long audio of unseen lengths.","short_abstract":"Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. Fir...","url_abs":"https://arxiv.org/abs/2510.15231","url_pdf":"https://arxiv.org/pdf/2510.15231v2","authors":"[\"Yuatyong Chaichana\",\"Pittawat Taveekitworachai\",\"Warit Sirichotedumrong\",\"Potsawee Manakul\",\"Kunat Pipatanakul\"]","published":"2025-10-17T01:44:28Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.SD\",\"eess.AS\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
