{"ID":2884101,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.07264","arxiv_id":"2508.07264","title":"FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning","abstract":"Multimodal classification requires robust integration of visual and textual signals, yet common fusion strategies are brittle and vulnerable to modality-specific noise. In this paper, we present \\textsc{FLUID}-Flow-Latent Unified Integration via Token Distillation for Expert Specialization, a principled token-level pipeline that improves cross-modal robustness and scalability. \\textsc{FLUID} contributes three core elements: (1) \\emph{Q-transforms}, learnable query tokens that distill and retain salient token-level features from modality-specific backbones; (2) a two-stage fusion scheme that enforces cross-modal consistency via contrastive alignment and then performs adaptive, task-aware fusion through a gating mechanism and a \\emph{Q-bottleneck} that selectively compresses information for downstream reasoning; and (3) a lightweight, load-balanced Mixture-of-Experts at prediction time that enables efficient specialization to diverse semantic patterns. Extensive experiments demonstrate that \\textsc{FLUID} attains \\(91\\%\\) accuracy on the GLAMI-1M benchmark, significantly outperforming prior baselines and exhibiting strong resilience to label noise, long-tail class imbalance, and semantic heterogeneity. Targeted ablation studies corroborate both the individual and synergistic benefits of the proposed components, positioning \\textsc{FLUID} as a scalable, noise-resilient solution for multimodal product classification.","short_abstract":"Multimodal classification requires robust integration of visual and textual signals, yet common fusion strategies are brittle and vulnerable to modality-specific noise. In this paper, we present \\textsc{FLUID}-Flow-Latent Unified Integration via Token Distillation for Expert Specialization, a principled token-level pip...","url_abs":"https://arxiv.org/abs/2508.07264","url_pdf":"https://arxiv.org/pdf/2508.07264v1","authors":"[\"Van Duc Cuong\",\"Ta Dinh Tam\",\"Tran Duc Chinh\",\"Nguyen Thi Hanh\"]","published":"2025-08-10T09:34:17Z","proceeding":"cs.SI","tasks":"[\"cs.SI\",\"cs.AI\"]","methods":"[]","has_code":false}
