{"ID":2862135,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.00974","arxiv_id":"2510.00974","title":"JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation","abstract":"Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose \\textbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based T2I.The code is now available: https://github.com/justin-herry/JEPA-T.git","short_abstract":"Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose \\textbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textua...","url_abs":"https://arxiv.org/abs/2510.00974","url_pdf":"https://arxiv.org/pdf/2510.00974v1","authors":"[\"Siheng Wan\",\"Zhengtao Yao\",\"Zhengdao Li\",\"Junhao Dong\",\"Yanshu Li\",\"Yikai Li\",\"Linshan Li\",\"Haoyan Xu\",\"Yijiang Li\",\"Zhikang Dong\",\"Huacan Wang\",\"Jifeng Shen\"]","published":"2025-10-01T14:51:10Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Transformer\"]","has_code":false,"code_links":[{"ID":608875,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2862135,"paper_url":"https://arxiv.org/abs/2510.00974","paper_title":"JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation","repo_url":"https://github.com/justin-herry/JEPA-T.git","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
