{"ID":2832754,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2601.04203","arxiv_id":"2601.04203","title":"FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback","abstract":"We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at https://github.com/shirley-wu/frontalk","short_abstract":"We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet t...","url_abs":"https://arxiv.org/abs/2601.04203","url_pdf":"https://arxiv.org/pdf/2601.04203v1","authors":"[\"Xueqing Wu\",\"Zihan Xue\",\"Da Yin\",\"Shuyan Zhou\",\"Kai-Wei Chang\",\"Nanyun Peng\",\"Yeming Wen\"]","published":"2025-12-05T23:28:09Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.CV\",\"cs.LG\",\"cs.SE\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":606271,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2832754,"paper_url":"https://arxiv.org/abs/2601.04203","paper_title":"FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback","repo_url":"https://github.com/shirley-wu/frontalk","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}