{"ID":2874039,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.05830","arxiv_id":"2509.05830","title":"Finetuning LLMs for Human Behavior Prediction in Social Science Experiments","abstract":"Large language models (LLMs) offer a powerful opportunity to simulate the results of social science experiments. In this work, we demonstrate that finetuning LLMs directly on individual-level responses from past experiments meaningfully improves the accuracy of such simulations across diverse social science domains. We construct SocSci210 via an automatic pipeline, a dataset comprising 2.9 million responses from 400,491 participants in 210 open-source social science experiments. Through finetuning, we achieve multiple levels of generalization. In completely unseen studies, our strongest model, Socrates-Qwen-14B, produces predictions that are 26% more aligned with distributions of human responses to diverse outcome questions under varying conditions relative to its base model (Qwen2.5-14B), outperforming GPT-4o by 13%. By finetuning on a subset of conditions in a study, generalization to new unseen conditions is particularly robust, improving by 71%. Since SocSci210 contains rich demographic information, we reduce demographic parity difference, a measure of bias, by 10.6% through finetuning. Because social sciences routinely generate rich, topic-specific datasets, our findings indicate that finetuning on such data could enable more accurate simulations for experimental hypothesis screening. We release our data, models and finetuning code at stanfordhci.github.io/socrates.","short_abstract":"Large language models (LLMs) offer a powerful opportunity to simulate the results of social science experiments. In this work, we demonstrate that finetuning LLMs directly on individual-level responses from past experiments meaningfully improves the accuracy of such simulations across diverse social science domains. We...","url_abs":"https://arxiv.org/abs/2509.05830","url_pdf":"https://arxiv.org/pdf/2509.05830v2","authors":"[\"Akaash Kolluri\",\"Shengguang Wu\",\"Joon Sung Park\",\"Michael S. Bernstein\"]","published":"2025-09-06T20:52:08Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CY\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
