{"ID":2853099,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.16756","arxiv_id":"2510.16756","title":"End-to-end Listen, Look, Speak and Act","abstract":"Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released at https://github.com/bytedance/SALMONN/tree/ELLSA.","short_abstract":"Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledg...","url_abs":"https://arxiv.org/abs/2510.16756","url_pdf":"https://arxiv.org/pdf/2510.16756v2","authors":"[\"Siyin Wang\",\"Wenyi Yu\",\"Xianzhao Chen\",\"Xiaohai Tian\",\"Jun Zhang\",\"Lu Lu\",\"Chao Zhang\"]","published":"2025-10-19T08:45:46Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\",\"cs.CV\",\"cs.RO\",\"eess.AS\"]","methods":"[]","has_code":false,"code_links":[{"ID":608054,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2853099,"paper_url":"https://arxiv.org/abs/2510.16756","paper_title":"End-to-end Listen, Look, Speak and Act","repo_url":"https://github.com/bytedance/SALMONN","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}