{"ID":2891572,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.18649","arxiv_id":"2507.18649","title":"Livatar-1: Real-Time Talking Heads Generation with Tailored Flow Matching","abstract":"We present Livatar, a real-time audio-driven talking heads videos generation framework. Existing baselines suffer from limited lip-sync accuracy and long-term pose drift. We address these limitations with a flow matching based framework. Coupled with system optimizations, Livatar achieves competitive lip-sync quality with a 8.50 LipSync Confidence on the HDTF dataset, and reaches a throughput of 141 FPS with an end-to-end latency of 0.17s on a single A10 GPU. This makes high-fidelity avatars accessible to broader applications. Our project is available at https://www.hedra.com/ with with examples at https://h-liu1997.github.io/Livatar-1/","short_abstract":"We present Livatar, a real-time audio-driven talking heads videos generation framework. Existing baselines suffer from limited lip-sync accuracy and long-term pose drift. We address these limitations with a flow matching based framework. Coupled with system optimizations, Livatar achieves competitive lip-sync quality w...","url_abs":"https://arxiv.org/abs/2507.18649","url_pdf":"https://arxiv.org/pdf/2507.18649v1","authors":"[\"Haiyang Liu\",\"Xiaolin Hong\",\"Xuancheng Yang\",\"Yudi Ruan\",\"Xiang Lian\",\"Michael Lingelbach\",\"Hongwei Yi\",\"Wei Li\"]","published":"2025-07-22T01:02:29Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","project_urls":"[\"https://www.hedra.com/\"]","has_code":false}
