{"ID":2855861,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.12709","arxiv_id":"2510.12709","title":"SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model","abstract":"Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design. In the optimization procedure, we propose a multi-stage training scheme to boost the multifaceted effectiveness of representation learning. Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency. The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings while mining user historical interests. Concurrently, we develop the stochastic specialization and dataset-driven pattern matching to strengthen model training flexibility and generalizability. Experimental results show that SAIL-Embedding achieves SOTA performance compared to other methods in different retrieval tasks. In online experiments across various real-world scenarios integrated with our model, we observe a significant increase in Lifetime (LT), which is a crucial indicator for the recommendation experience. For instance, the model delivers the 7-day LT gain of +0.5% in the Douyin-Selected scenario. For the Douyin feed rank model, the match features produced by SAIL-Embedding yield a +0.1% AUC gain.","short_abstract":"Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and busi...","url_abs":"https://arxiv.org/abs/2510.12709","url_pdf":"https://arxiv.org/pdf/2510.12709v3","authors":"[\"Lin Lin\",\"Jiefeng Long\",\"Zhihe Wan\",\"Yuchi Wang\",\"Dingkang Yang\",\"Shuang Yang\",\"Yueyang Yao\",\"Xu Chen\",\"Zirui Guo\",\"Shengqiang Li\",\"Weiran Li\",\"Hanyu Li\",\"Yaling Mou\",\"Yan Qiu\",\"Haiyang Yu\",\"Xiao Liang\",\"Hongsheng Li\",\"Chao Feng\"]","published":"2025-10-14T16:43:22Z","proceeding":"cs.IR","tasks":"[\"cs.IR\",\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false}
