{"ID":2828633,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.14652","arxiv_id":"2512.14652","title":"Segmental Attention Decoding With Long Form Acoustic Encodings","abstract":"We address the fundamental incompatibility of attention-based encoder-decoder (AED) models with long-form acoustic encodings. AED models trained on segmented utterances learn to encode absolute frame positions by exploiting limited acoustic context beyond segment boundaries, but fail to generalize when decoding long-form segments where these cues vanish. The model loses ability to order acoustic encodings due to permutation invariance of keys and values in cross-attention. We propose four modifications: (1) injecting explicit absolute positional encodings into cross-attention for each decoded segment, (2) long-form training with extended acoustic context to eliminate implicit absolute position encoding, (3) segment concatenation to cover diverse segmentations needed during training, and (4) semantic segmentation to align AED-decoded segments with training segments. We show these modifications close the accuracy gap between continuous and segmented acoustic encodings, enabling auto-regressive use of the attention decoder.","short_abstract":"We address the fundamental incompatibility of attention-based encoder-decoder (AED) models with long-form acoustic encodings. AED models trained on segmented utterances learn to encode absolute frame positions by exploiting limited acoustic context beyond segment boundaries, but fail to generalize when decoding long-fo...","url_abs":"https://arxiv.org/abs/2512.14652","url_pdf":"https://arxiv.org/pdf/2512.14652v1","authors":"[\"Pawel Swietojanski\",\"Xinwei Li\",\"Mingbin Xu\",\"Takaaki Hori\",\"Dogan Can\",\"Xiaodan Zhuang\"]","published":"2025-12-16T18:12:37Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.CL\"]","methods":"[]","has_code":false}
