{"ID":2833931,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.02593","arxiv_id":"2512.02593","title":"Spoken Conversational Agents with Large Language Models","abstract":"Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.","short_abstract":"Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness acr...","url_abs":"https://arxiv.org/abs/2512.02593","url_pdf":"https://arxiv.org/pdf/2512.02593v1","authors":"[\"Chao-Han Huck Yang\",\"Andreas Stolcke\",\"Larry Heck\"]","published":"2025-12-02T10:02:10Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.MA\",\"cs.NE\",\"cs.SD\",\"eess.AS\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}