{"ID":2860187,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.04016","arxiv_id":"2510.04016","title":"Thai Semantic End-of-Turn Detection for Real-Time Voice Agents","abstract":"Fluid voice-to-voice interaction requires reliable and low-latency detection of when a user has finished speaking. Traditional audio-silence end-pointers add hundreds of milliseconds of delay and fail under hesitations or language-specific phenomena. We present, to our knowledge, the first systematic study of Thai text-only end-of-turn (EOT) detection for real-time agents. We compare zero-shot and few-shot prompting of compact LLMs to supervised fine-tuning of lightweight transformers. Using transcribed subtitles from the YODAS corpus and Thai-specific linguistic cues (e.g., sentence-final particles), we formulate EOT as a binary decision over token boundaries. We report a clear accuracy-latency tradeoff and provide a public-ready implementation plan. This work establishes a Thai baseline and demonstrates that small, fine-tuned models can deliver near-instant EOT decisions suitable for on-device agents.","short_abstract":"Fluid voice-to-voice interaction requires reliable and low-latency detection of when a user has finished speaking. Traditional audio-silence end-pointers add hundreds of milliseconds of delay and fail under hesitations or language-specific phenomena. We present, to our knowledge, the first systematic study of Thai text...","url_abs":"https://arxiv.org/abs/2510.04016","url_pdf":"https://arxiv.org/pdf/2510.04016v1","authors":"[\"Thanapol Popit\",\"Natthapath Rungseesiripak\",\"Monthol Charattrakool\",\"Saksorn Ruangtanusak\"]","published":"2025-10-05T03:31:59Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Transformer\",\"Large Language Model\"]","has_code":false}
