{"ID":2851610,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.19368","arxiv_id":"2510.19368","title":"AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch","abstract":"Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 \u0026 V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.","short_abstract":"Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-sc...","url_abs":"https://arxiv.org/abs/2510.19368","url_pdf":"https://arxiv.org/pdf/2510.19368v2","authors":"[\"Weichuang Shao\",\"Iman Yi Liao\",\"Tomas Henrique Bode Maul\",\"Tissa Chandesa\"]","published":"2025-10-22T08:41:59Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.LG\"]","methods":"[\"Transformer\",\"Convolutional Neural Network\"]","has_code":false}
