{"ID":2860981,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.08587","arxiv_id":"2510.08587","title":"EGSTalker: Real-Time Audio-Driven Talking Head Generation with Efficient Gaussian Deformation","abstract":"This paper presents EGSTalker, a real-time audio-driven talking head generation framework based on 3D Gaussian Splatting (3DGS). Designed to enhance both speed and visual fidelity, EGSTalker requires only 3-5 minutes of training video to synthesize high-quality facial animations. The framework comprises two key stages: static Gaussian initialization and audio-driven deformation. In the first stage, a multi-resolution hash triplane and a Kolmogorov-Arnold Network (KAN) are used to extract spatial features and construct a compact 3D Gaussian representation. In the second stage, we propose an Efficient Spatial-Audio Attention (ESAA) module to fuse audio and spatial cues, while KAN predicts the corresponding Gaussian deformations. Extensive experiments demonstrate that EGSTalker achieves rendering quality and lip-sync accuracy comparable to state-of-the-art methods, while significantly outperforming them in inference speed. These results highlight EGSTalker's potential for real-time multimedia applications.","short_abstract":"This paper presents EGSTalker, a real-time audio-driven talking head generation framework based on 3D Gaussian Splatting (3DGS). Designed to enhance both speed and visual fidelity, EGSTalker requires only 3-5 minutes of training video to synthesize high-quality facial animations. The framework comprises two key stages:...","url_abs":"https://arxiv.org/abs/2510.08587","url_pdf":"https://arxiv.org/pdf/2510.08587v1","authors":"[\"Tianheng Zhu\",\"Yinfeng Yu\",\"Liejun Wang\",\"Fuchun Sun\",\"Wendong Zheng\"]","published":"2025-10-03T14:31:20Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"eess.AS\"]","methods":"[]","has_code":false}
