{"ID":2875022,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.03292","arxiv_id":"2509.03292","title":"Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings","abstract":"We present a system for automatic multi-axis perceptual quality prediction of generative audio, developed for Track 2 of the AudioMOS Challenge 2025. The task is to predict four Audio Aesthetic Scores--Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness--for audio generated by text-to-speech (TTS), text-to-audio (TTA), and text-to-music (TTM) systems. A main challenge is the domain shift between natural training data and synthetic evaluation data. To address this, we combine BEATs, a pretrained transformer-based audio representation model, with a multi-branch long short-term memory (LSTM) predictor and use a triplet loss with buffer-based sampling to structure the embedding space by perceptual similarity. Our results show that this improves embedding discriminability and generalization, enabling domain-robust audio quality assessment without synthetic training data.","short_abstract":"We present a system for automatic multi-axis perceptual quality prediction of generative audio, developed for Track 2 of the AudioMOS Challenge 2025. The task is to predict four Audio Aesthetic Scores--Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness--for audio generated by text-to-s...","url_abs":"https://arxiv.org/abs/2509.03292","url_pdf":"https://arxiv.org/pdf/2509.03292v1","authors":"[\"Dyah A. M. G. Wisnu\",\"Ryandhimas E. Zezario\",\"Stefano Rini\",\"Hsin-Min Wang\",\"Yu Tsao\"]","published":"2025-09-03T13:19:56Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.LG\",\"cs.SD\"]","methods":"[\"Transformer\"]","has_code":false}
