{"ID":2860271,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.04136","arxiv_id":"2510.04136","title":"MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition","abstract":"Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce inference cost, but they require fixing a compression rate in advance and produce a single fixed-length output, offering no flexibility to balance information density and efficiency at inference time. Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities, allowing compression rates to be adjusted dynamically. However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen LLM with top-k routed and shared experts, allowing dynamic capacity allocation across scales and modalities. A shared router promotes consistent expert activation across granularities, enabling compressed sequences to benefit from representations learned at lower compression. Experiments on LRS2 and LRS3 demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise. MoME unifies the adaptability of MRL with the efficiency of MoE, offering a scalable and interpretable solution for resource-aware speech recognition.","short_abstract":"Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce inference cost, but they require fixing a...","url_abs":"https://arxiv.org/abs/2510.04136","url_pdf":"https://arxiv.org/pdf/2510.04136v1","authors":"[\"Umberto Cappellazzo\",\"Minsu Kim\",\"Pingchuan Ma\",\"Honglie Chen\",\"Xubo Liu\",\"Stavros Petridis\",\"Maja Pantic\"]","published":"2025-10-05T10:34:34Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.CV\",\"cs.SD\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
