{"ID":2866404,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.20410","arxiv_id":"2509.20410","title":"Phoenix-VAD: Streaming Semantic Endpoint Detection for Full-Duplex Speech Interaction","abstract":"Spoken dialogue models have significantly advanced intelligent human-computer interaction, yet they lack a plug-and-play full-duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix-VAD, an LLM-based model that enables streaming semantic endpoint detection. Specifically, Phoenix-VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. Experiments on both semantically complete and incomplete speech scenarios indicate that Phoenix-VAD achieves excellent and competitive performance. Furthermore, this design enables the full-duplex prediction module to be optimized independently of the dialogue model, providing more reliable and flexible support for next-generation human-computer interaction.","short_abstract":"Spoken dialogue models have significantly advanced intelligent human-computer interaction, yet they lack a plug-and-play full-duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix-VAD, an LLM-based model that enables streaming semantic endpo...","url_abs":"https://arxiv.org/abs/2509.20410","url_pdf":"https://arxiv.org/pdf/2509.20410v4","authors":"[\"Weijie Wu\",\"Wenhao Guan\",\"Kaidi Wang\",\"Peijie Chen\",\"Zhuanling Zha\",\"Junbo Li\",\"Jun Fang\",\"Lin Li\",\"Qingyang Hong\"]","published":"2025-09-24T07:09:19Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.SD\"]","methods":"[\"Large Language Model\"]","has_code":false}
