{"ID":2890731,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.18055","arxiv_id":"2507.18055","title":"Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs","abstract":"The increasing use of synthetic data generated by Large Language Models (LLMs) presents both opportunities and challenges in data-driven applications. While synthetic data provides a cost-effective, scalable alternative to real-world data to facilitate model training, its diversity and privacy risks remain underexplored. Focusing on text-based synthetic data, we propose a comprehensive set of metrics to quantitatively assess the diversity (i.e., linguistic expression, sentiment, and user perspective), and privacy (i.e., re-identification risk and stylistic outliers) of synthetic datasets generated by several state-of-the-art LLMs. Experiment results reveal significant limitations in LLMs' capabilities in generating diverse and privacy-preserving synthetic data. Guided by the evaluation results, a prompt-based approach is proposed to enhance the diversity of synthetic reviews while preserving reviewer privacy.","short_abstract":"The increasing use of synthetic data generated by Large Language Models (LLMs) presents both opportunities and challenges in data-driven applications. While synthetic data provides a cost-effective, scalable alternative to real-world data to facilitate model training, its diversity and privacy risks remain underexplore...","url_abs":"https://arxiv.org/abs/2507.18055","url_pdf":"https://arxiv.org/pdf/2507.18055v1","authors":"[\"Tevin Atwal\",\"Chan Nam Tieu\",\"Yefeng Yuan\",\"Zhan Shi\",\"Yuhong Liu\",\"Liang Cheng\"]","published":"2025-07-24T03:12:16Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.CR\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}