{"ID":2862948,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.26542","arxiv_id":"2509.26542","title":"Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap","abstract":"We present Voice Evaluation of Reasoning Ability (VERA), a benchmark for evaluating reasoning ability in voice-interactive systems under real-time conversational constraints. VERA comprises 2,931 voice-native episodes derived from established text benchmarks and organized into five tracks (Math, Web, Science, Long-Context, Factual). Each item is adapted for speech interaction while preserving reasoning difficulty. VERA enables direct text-voice comparison within model families and supports analysis of how architectural choices affect reliability. We assess 12 contemporary voice systems alongside strong text baselines and observe large, consistent modality gaps: on competition mathematics a leading text model attains 74.8% accuracy while its voice counterpart reaches 6.1%; macro-averaged across tracks the best text models achieve 54.0% versus 11.3% for voice. Latency-accuracy analyses reveal a low-latency plateau, where fast voice systems cluster around ~10% accuracy, while approaching text performance requires sacrificing real-time interaction. Diagnostic experiments indicate that common mitigations are insufficient. Increasing \"thinking time\" yields negligible gains; a decoupled cascade that separates reasoning from narration improves accuracy but still falls well short of text and introduces characteristic grounding/consistency errors. Failure analyses further show distinct error signatures across native streaming, end-to-end, and cascade designs. VERA provides a reproducible testbed and targeted diagnostics for architectures that decouple thinking from speaking, offering a principled way to measure progress toward real-time voice assistants that are both fluent and reliably reasoned.","short_abstract":"We present Voice Evaluation of Reasoning Ability (VERA), a benchmark for evaluating reasoning ability in voice-interactive systems under real-time conversational constraints. VERA comprises 2,931 voice-native episodes derived from established text benchmarks and organized into five tracks (Math, Web, Science, Long-Cont...","url_abs":"https://arxiv.org/abs/2509.26542","url_pdf":"https://arxiv.org/pdf/2509.26542v1","authors":"[\"Yueqian Lin\",\"Zhengmian Hu\",\"Qinsi Wang\",\"Yudong Liu\",\"Hengfan Zhang\",\"Jayakumar Subramanian\",\"Nikos Vlassis\",\"Hai Helen Li\",\"Yiran Chen\"]","published":"2025-09-30T17:17:09Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.MM\",\"cs.SD\"]","methods":"[\"Generative Adversarial Network\"]","has_code":false}
