{"ID":3006047,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-04T18:58:18.388484401Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.02798","arxiv_id":"2606.02798","title":"BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces","abstract":"Many decision-support settings require systems that adapt to individual users, but evaluation data for this problem remain limited. Existing benchmarks for user understanding often rely on simulated users or model-generated behavior, even though recent work cautions that model-based simulations can diverge systematically from human behavior. We introduce \\textsc{BehaviorBench}, a benchmark for evaluating personalized decision modeling from real-world behavioral traces. \\textsc{BehaviorBench} reconstructs wallet-level decision histories from observed public prediction-market and on-chain records, and organizes them into two complementary task layers: \\emph{Belief prediction}, which predicts a user's final revealed stance and confidence in a market, and \\emph{Trade prediction}, which predicts the direction and amount of individual transactions. Across 2,000 evaluation wallets, the benchmark contains 141,445 Belief instances and 1,485,972 Trade instances, with disjoint support pools for retrieval-based evaluation. We evaluate frontier and open-weight generative models under four history interfaces: no personalization, direct recent history, generated user profiles, and retrieved support-wallet evidence. Personalization improves Belief prediction more consistently than Trade prediction, model rankings change across task layers and metrics, and different history interfaces expose different failure modes. \\textsc{BehaviorBench} provides an evaluation setting for studying whether personalized methods can use real-world behavioral evidence rather than simulated users alone.","short_abstract":"Many decision-support settings require systems that adapt to individual users, but evaluation data for this problem remain limited. Existing benchmarks for user understanding often rely on simulated users or model-generated behavior, even though recent work cautions that model-based simulations can diverge systematical...","url_abs":"https://arxiv.org/abs/2606.02798","url_pdf":"https://arxiv.org/pdf/2606.02798v1","authors":"[\"Liangwei Yang\",\"Jielin Qiu\",\"Zixiang Chen\",\"Ming Zhu\",\"Juntao Tan\",\"Zhiwei Liu\",\"Wenting Zhao\",\"Zhujun Lan\",\"Akshara Prabhakar\",\"Silvio Savarese\",\"Huan Wang\",\"Shelby Heinecke\"]","published":"2026-06-01T19:04:36Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Generative Adversarial Network\"]","has_code":false}