{"ID":3084743,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T00:57:50.230973856Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05544","arxiv_id":"2606.05544","title":"Probing Spatial Structure in Pretrained Audio Representations","abstract":"Pretrained spatial audio encoders are increasingly used as general-purpose representations for perceptual tasks, yet their spatial encoding capabilities remain poorly understood. We introduce the Spatial Audio Representation Learning (SARL) benchmark, a controlled framework for evaluating spatial information in pretrained audio models. SARL probes source-level factors (azimuth, elevation, distance, class) and room-level factors (RT60, volume, shape). Experiments across diverse encoders reveal three patterns: input configuration and training paradigm shape spatial encoding; source factors are consistently easier to decode than room factors; and sensitivity analysis under controlled perturbations shows heterogeneous responses to source and room variation. These results reveal systematic biases in current pretrained audio representations. SARL is released as an open-source benchmark for reproducible evaluation of spatial audio representations.","short_abstract":"Pretrained spatial audio encoders are increasingly used as general-purpose representations for perceptual tasks, yet their spatial encoding capabilities remain poorly understood. We introduce the Spatial Audio Representation Learning (SARL) benchmark, a controlled framework for evaluating spatial information in pretrai...","url_abs":"https://arxiv.org/abs/2606.05544","url_pdf":"https://arxiv.org/pdf/2606.05544v1","authors":"[\"Chuyang Chen\",\"Sivan Ding\",\"Adrian S. Roman\",\"Juan Pablo Bello\"]","published":"2026-06-04T00:58:16Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"eess.AS\"]","methods":"[]","has_code":false}
