{"ID":2890473,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.19196","arxiv_id":"2507.19196","title":"Towards Multimodal Social Conversations with Robots: Using Vision-Language Models","abstract":"Large language models have given social robots the ability to autonomously engage in open-domain conversations. However, they are still missing a fundamental social skill: making use of the multiple modalities that carry social interactions. While previous work has focused on task-oriented interactions that require referencing the environment or specific phenomena in social interactions such as dialogue breakdowns, we outline the overall needs of a multimodal system for social conversations with robots. We then argue that vision-language models are able to process this wide range of visual information in a sufficiently general manner for autonomous social robots. We describe how to adapt them to this setting, which technical challenges remain, and briefly discuss evaluation practices.","short_abstract":"Large language models have given social robots the ability to autonomously engage in open-domain conversations. However, they are still missing a fundamental social skill: making use of the multiple modalities that carry social interactions. While previous work has focused on task-oriented interactions that require ref...","url_abs":"https://arxiv.org/abs/2507.19196","url_pdf":"https://arxiv.org/pdf/2507.19196v2","authors":"[\"Ruben Janssens\",\"Tony Belpaeme\"]","published":"2025-07-25T12:06:53Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.CL\",\"cs.HC\"]","methods":"[\"Language Model\"]","has_code":false}