{"ID":2869135,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.14594","arxiv_id":"2509.14594","title":"SynBench: A Benchmark for Differentially Private Text Generation","abstract":"Synthetic text generation with Differential Privacy (DP) guarantees emerges as a principled approach that can enable the sharing of sensitive datasets across institutional and regulatory boundaries, while bounding the risks of re-identification and membership inference. LLM-based methods deliver promising results; however, comparisons are exacerbated by differing evaluation setups and \"private\" datasets, potential pre-training contamination is not considered and guarantees are not verified with DP audits. To advance this field, we introduce a unified evaluation framework with standardised utility and fidelity metrics and privacy audits, encompassing nine curated datasets that capture domain-specific complexities such as technical jargon, long-context dependencies, and specialised document structures. In a large-scale empirical study, we benchmark LLM-based state-of-the-art DP text generators of varying sizes (between 1--8B). Our results indicate that DP synthetic text generation remains an unsolved challenge, with quality deteriorating more as the private datasets deviate further from the generators' pre-training corpora. Our novel synthetic text membership inference attack (MIA) explains this observation: Synthetic data quality is overestimated when LLMs have been pre-trained -- without DP -- on portions of the \"private\" data to be generated. Finally, our work provides the first quantitative evidence that this \"public pre-training and private generation\" paradigm invalidates the guaranteed privacy bounds of real-world private datasets.","short_abstract":"Synthetic text generation with Differential Privacy (DP) guarantees emerges as a principled approach that can enable the sharing of sensitive datasets across institutional and regulatory boundaries, while bounding the risks of re-identification and membership inference. LLM-based methods deliver promising results; howe...","url_abs":"https://arxiv.org/abs/2509.14594","url_pdf":"https://arxiv.org/pdf/2509.14594v2","authors":"[\"Yidan Sun\",\"Viktor Schlegel\",\"Srinivasan Nandakumar\",\"Iqra Zahid\",\"Yuping Wu\",\"Yulong Wu\",\"Hao Li\",\"Jie Zhang\",\"Warren Del-Pinto\",\"Goran Nenadic\",\"Siew Kei Lam\",\"Anil Anthony Bharath\"]","published":"2025-09-18T03:57:50Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}
