{"ID":2842210,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.10190","arxiv_id":"2511.10190","title":"Towards Leveraging Sequential Structure in Animal Vocalizations","abstract":"Animal vocalizations contain sequential structures that carry important communicative information, yet most computational bioacoustics studies average the extracted frame-level features across the temporal axis, discarding the order of the sub-units within a vocalization. This paper investigates whether discrete acoustic token sequences, derived through vector quantization and gumbel-softmax vector quantization of extracted self-supervised speech model representations can effectively capture and leverage temporal information. To that end, pairwise distance analysis of token sequences generated from HuBERT embeddings shows that they can discriminate call-types and callers across four bioacoustics datasets. Sequence classification experiments using $k$-Nearest Neighbour with Levenshtein distance show that the vector-quantized token sequences yield reasonable call-type and caller classification performances, and hold promise as alternative feature representations towards leveraging sequential information in animal vocalizations.","short_abstract":"Animal vocalizations contain sequential structures that carry important communicative information, yet most computational bioacoustics studies average the extracted frame-level features across the temporal axis, discarding the order of the sub-units within a vocalization. This paper investigates whether discrete acoust...","url_abs":"https://arxiv.org/abs/2511.10190","url_pdf":"https://arxiv.org/pdf/2511.10190v1","authors":"[\"Eklavya Sarkar\",\"Mathew Magimai. -Doss\"]","published":"2025-11-13T11:00:38Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[]","has_code":false}