{"ID":2869949,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.14136","arxiv_id":"2509.14136","title":"SV-Mixer: Replacing the Transformer Encoder with Lightweight MLPs for Self-Supervised Model Compression in Speaker Verification","abstract":"Self-supervised learning (SSL) has pushed speaker verification accuracy close to state-of-the-art levels, but the Transformer backbones used in most SSL encoders hinder on-device and real-time deployment. Prior compression work trims layer depth or width yet still inherits the quadratic cost of self-attention. We propose SV-Mixer, the first fully MLP-based student encoder for SSL distillation. SV-Mixer replaces Transformer with three lightweight modules: Multi-Scale Mixing for multi-resolution temporal features, Local-Global Mixing for frame-to-utterance context, and Group Channel Mixing for spectral subspaces. Distilled from WavLM, SV-Mixer outperforms a Transformer student by 14.6% while cutting parameters and GMACs by over half, and at 75% compression, it closely matches the teacher's performance. Our results show that attention-free SSL students can deliver teacher-level accuracy with hardware-friendly footprints, opening the door to robust on-device speaker verification.","short_abstract":"Self-supervised learning (SSL) has pushed speaker verification accuracy close to state-of-the-art levels, but the Transformer backbones used in most SSL encoders hinder on-device and real-time deployment. Prior compression work trims layer depth or width yet still inherits the quadratic cost of self-attention. We propo...","url_abs":"https://arxiv.org/abs/2509.14136","url_pdf":"https://arxiv.org/pdf/2509.14136v1","authors":"[\"Jungwoo Heo\",\"Hyun-seo Shin\",\"Chan-yeong Lim\",\"Kyo-won Koo\",\"Seung-bin Kim\",\"Jisoo Son\",\"Ha-Jin Yu\"]","published":"2025-09-17T16:16:30Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Transformer\"]","has_code":false}
