{"ID":2898158,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.04094","arxiv_id":"2507.04094","title":"MMMOS: Multi-domain Multi-axis Audio Quality Assessment","abstract":"Accurate audio quality estimation is essential for developing and evaluating audio generation, retrieval, and enhancement systems. Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech, merging diverse perceptual factors and failing to generalize beyond speech. We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four orthogonal axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness across speech, music, and environmental sounds. MMMOS fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, and M2D) and evaluates three aggregation strategies with four loss functions. By ensembling the top eight models, MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall's τ versus baseline, gains first place in six of eight Production Complexity metrics, and ranks among the top three on 17 of 32 challenge metrics.","short_abstract":"Accurate audio quality estimation is essential for developing and evaluating audio generation, retrieval, and enhancement systems. Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech, merging diverse perceptual factors and failing to generalize beyond speech. We propose MMMOS,...","url_abs":"https://arxiv.org/abs/2507.04094","url_pdf":"https://arxiv.org/pdf/2507.04094v2","authors":"[\"Yi-Cheng Lin\",\"Jia-Hung Chen\",\"Hung-yi Lee\"]","published":"2025-07-05T16:42:09Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.AI\",\"cs.CL\"]","methods":"[]","has_code":false}
