{"ID":2871662,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.09960","arxiv_id":"2509.09960","title":"Limited Reference, Reliable Generation: A Two-Component Framework for Tabular Data Generation in Low-Data Regimes","abstract":"Synthetic tabular data generation is increasingly essential in data management, supporting downstream applications when real-world and high-quality tabular data is insufficient. Existing tabular generation approaches, such as generative adversarial networks (GANs), diffusion models, and fine-tuned Large Language Models (LLMs), typically require sufficient reference data, limiting their effectiveness in domain-specific databases with scarce records. While prompt-based LLMs offer flexibility without parameter tuning, they often fail to capture dataset-specific feature-label dependencies and generate redundant data, leading to degradation in downstream task performance. To overcome these issues, we propose ReFine, a framework that (i) derives symbolic \"if-then\" rules from interpretable models and embeds them into prompts to explicitly guide generation toward domain-specific feature distribution, and (ii) applies a dual-granularity filtering strategy that suppresses over-sampling patterns and selectively refines rare but informative samples to reduce distributional imbalance. Extensive experiments on various regression and classification benchmarks demonstrate that ReFine consistently outperforms state-of-the-art methods, achieving up to 0.44 absolute improvement in R-squared for regression and 10.0 percent relative improvement in F1 score for classification tasks.","short_abstract":"Synthetic tabular data generation is increasingly essential in data management, supporting downstream applications when real-world and high-quality tabular data is insufficient. Existing tabular generation approaches, such as generative adversarial networks (GANs), diffusion models, and fine-tuned Large Language Models...","url_abs":"https://arxiv.org/abs/2509.09960","url_pdf":"https://arxiv.org/pdf/2509.09960v1","authors":"[\"Mingxuan Jiang\",\"Yongxin Wang\",\"Ziyue Dai\",\"Yicun Liu\",\"Hongyi Nie\",\"Sen Liu\",\"Hongfeng Chai\"]","published":"2025-09-12T04:34:46Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Diffusion Model\",\"Large Language Model\",\"Language Model\",\"Generative Adversarial Network\"]","has_code":false}
