{"ID":3050175,"CreatedAt":"2026-06-04T02:13:16.786527022Z","UpdatedAt":"2026-06-06T08:26:15.225160212Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.04592","arxiv_id":"2606.04592","title":"Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata?","abstract":"LLM-based digital twins promise to scale and accelerate market research, but most published twins are either coarse persona bots conditioned on a few demographic questions or detailed individual-level twins built on purpose-collected surveys and interview transcripts. Neither setup speaks to the operationally most relevant case for marketing practice: building detailed individual twins from the pre-existing heterogeneous panel data that firms already accumulate through CRM systems, loyalty programs, and repeat surveys. We construct detailed individual-level twins from the German Socio-Economic Panel (SOEP) and evaluate them across a $3 \\times 5 \\times 2 \\times 2$ construction-method grid that covers three open-weights LLMs, five cumulative information depths ranked by normalized Shannon entropy, two embedding methods, and two reasoning modes, scoring over 2.1 million twin responses on 500 participants and 183 held-out questions. Twin quality rises with information depth but with diminishing returns past the 75 percent entropy quartile, which acts as a cost-efficient Pareto point relative to the best-performing 100 percent cells. Switching the embedding from a narrative persona summary to a raw dialog history of past responses raises hold-out accuracy in every model-by-reasoning cell at the 100 percent depth, while an explicit thinking mode raises rank-order correlation without moving accuracy. Best-cell accuracy reaches 78.8 percent and Fisher-$z$ correlation reaches $r = 0.590$ on the SOEP held-out evaluation set. The findings suggest that twin-based market research is no longer gated by data design, but by item volume, model selection, and a small set of construction-level decisions that this paper now maps.","short_abstract":"LLM-based digital twins promise to scale and accelerate market research, but most published twins are either coarse persona bots conditioned on a few demographic questions or detailed individual-level twins built on purpose-collected surveys and interview transcripts. Neither setup speaks to the operationally most rele...","url_abs":"https://arxiv.org/abs/2606.04592","url_pdf":"https://arxiv.org/pdf/2606.04592v1","authors":"[\"Leonard Kinzinger\",\"Jochen Hartmann\"]","published":"2026-06-03T08:30:03Z","proceeding":"cs.CY","tasks":"[\"cs.CY\",\"cs.AI\",\"cs.HC\"]","methods":"[\"Large Language Model\"]","has_code":false}
