{"ID":2845956,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.03845","arxiv_id":"2511.03845","title":"To See or To Read: User Behavior Reasoning in Multimodal LLMs","abstract":"Multimodal Large Language Models (MLLMs) are reshaping how modern agentic systems reason over sequential user-behavior data. However, whether textual or image representations of user behavior data are more effective for maximizing MLLM performance remains underexplored. We present \\texttt{BehaviorLens}, a systematic benchmarking framework for assessing modality trade-offs in user-behavior reasoning across six MLLMs by representing transaction data as (1) a text paragraph, (2) a scatter plot, and (3) a flowchart. Using a real-world purchase-sequence dataset, we find that when data is represented as images, MLLMs next-purchase prediction accuracy is improved by 87.5% compared with an equivalent textual representation without any additional computational cost.","short_abstract":"Multimodal Large Language Models (MLLMs) are reshaping how modern agentic systems reason over sequential user-behavior data. However, whether textual or image representations of user behavior data are more effective for maximizing MLLM performance remains underexplored. We present \\texttt{BehaviorLens}, a systematic be...","url_abs":"https://arxiv.org/abs/2511.03845","url_pdf":"https://arxiv.org/pdf/2511.03845v1","authors":"[\"Tianning Dong\",\"Luyi Ma\",\"Varun Vasudevan\",\"Jason Cho\",\"Sushant Kumar\",\"Kannan Achan\"]","published":"2025-11-05T20:26:40Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
