{"ID":2898090,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.03971","arxiv_id":"2507.03971","title":"Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data","abstract":"Foundation models for tabular data, like TabPFN, achieve strong performance on small datasets when pre-trained solely on synthetic data. We show that this performance can be significantly boosted by a targeted continued pre-training phase. Specifically, we demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior downstream predictive accuracy compared to using broader, potentially noisier corpora like CommonCrawl or GitTables. Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.","short_abstract":"Foundation models for tabular data, like TabPFN, achieve strong performance on small datasets when pre-trained solely on synthetic data. We show that this performance can be significantly boosted by a targeted continued pre-training phase. Specifically, we demonstrate that leveraging a small, curated collection of larg...","url_abs":"https://arxiv.org/abs/2507.03971","url_pdf":"https://arxiv.org/pdf/2507.03971v1","authors":"[\"Anurag Garg\",\"Muhammad Ali\",\"Noah Hollmann\",\"Lennart Purucker\",\"Samuel Müller\",\"Frank Hutter\"]","published":"2025-07-05T09:39:07Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"stat.ME\",\"stat.ML\"]","methods":"[]","has_code":false}