{"ID":2887865,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.00956","arxiv_id":"2508.00956","title":"Learning Unified User Quantized Tokenizers for User Representation","abstract":"Multi-source user representation learning plays a critical role in enabling personalized services on web platforms (e.g., Alipay). While prior works have adopted late-fusion strategies to combine heterogeneous data sources, they suffer from three key limitations: lack of unified representation frameworks, scalability and storage issues in data compression, and inflexible cross-task generalization. To address these challenges, we propose U2QT (Unified User Quantized Tokenizers), a novel framework that integrates cross-domain knowledge transfer with early fusion of heterogeneous domains. Our framework employs a two-stage architecture: first, we use the Qwen3 Embedding model to derive a compact yet expressive feature representation; second, a multi-view RQ-VAE discretizes causal embeddings into compact tokens through shared and source-specific codebooks, enabling efficient storage while maintaining semantic coherence. Experimental results showcase U2QT's advantages across diverse downstream tasks, outperforming task-specific baselines in future behavior prediction and recommendation tasks while achieving efficiency gains in storage and computation. The unified tokenization framework enables seamless integration with language models and supports industrial-scale applications.","short_abstract":"Multi-source user representation learning plays a critical role in enabling personalized services on web platforms (e.g., Alipay). While prior works have adopted late-fusion strategies to combine heterogeneous data sources, they suffer from three key limitations: lack of unified representation frameworks, scalability a...","url_abs":"https://arxiv.org/abs/2508.00956","url_pdf":"https://arxiv.org/pdf/2508.00956v2","authors":"[\"Chuan He\",\"Yang Chen\",\"Wuliang Huang\",\"Tianyi Zheng\",\"Jianhu Chen\",\"Bin Dou\",\"Yice Luo\",\"Yun Zhu\",\"Baokun Wang\",\"Yongchao Liu\",\"Xing Fu\",\"Yu Cheng\",\"Chuntao Hong\",\"Weiqiang Wang\",\"Xin-Wei Yao\",\"Zhongle Xie\"]","published":"2025-08-01T08:35:32Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.IR\"]","methods":"[\"Language Model\",\"Variational Autoencoder\"]","has_code":false}