{"ID":2885755,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.04379","arxiv_id":"2508.04379","title":"VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones","abstract":"Recent studies have indicated that vision models pre-trained on images can serve as time series foundation models (TSFMs) by reformulating time series forecasting (TSF) as image reconstruction. However, effective cross-modal transfer from vision to time series remains challenging due to three discrepancies: (1) the data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) the multivariate-forecasting gap between fixed RGB-three-channel vision models and time series with arbitrary numbers of variates; and (3) the probabilistic-forecasting gap between the deterministic outputs of vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisonTS++, a TSFM based on continual pre-training of a vision model on large-scale time series. Our approach introduces three key innovations: (1) vision-model-based filtering to identify high-quality sequences to stabilize pre-training and mitigate modality gap; (2) colorized multivariate conversion, encoding multivariate series as multi-subfigure RGB images to enhance cross-variate modeling; (3) multi-quantile forecasting, using parallel reconstruction heads to generate quantile forecasts without parametric assumptions. Experiments show that VisionTS++ achieves state-of-the-art performance in both in-distribution and out-of-distribution forecasting, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in GIFT-Eval benchmark which comprises 23 datasets across 7 domains. Our work demonstrates that with appropriate adaptation, vision models can effectively generalize to TSF, thus advancing the pursuit of universal TSFMs. Code is available at https://github.com/HALF111/VisionTSpp.","short_abstract":"Recent studies have indicated that vision models pre-trained on images can serve as time series foundation models (TSFMs) by reformulating time series forecasting (TSF) as image reconstruction. However, effective cross-modal transfer from vision to time series remains challenging due to three discrepancies: (1) the dat...","url_abs":"https://arxiv.org/abs/2508.04379","url_pdf":"https://arxiv.org/pdf/2508.04379v3","authors":"[\"Lefei Shen\",\"Mouxiang Chen\",\"Xu Liu\",\"Han Fu\",\"Xiaoxue Ren\",\"Jianling Sun\",\"Zhuo Li\",\"Chenghao Liu\"]","published":"2025-08-06T12:17:09Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.LG\"]","methods":"[]","has_code":false,"code_links":[{"ID":611233,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2885755,"paper_url":"https://arxiv.org/abs/2508.04379","paper_title":"VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones","repo_url":"https://github.com/HALF111/VisionTSpp","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}