{"ID":2884859,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.06492","arxiv_id":"2508.06492","title":"Effective Training Data Synthesis for Improving MLLM Chart Understanding","abstract":"Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30%-50% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts. In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually diversify the generated figures, filter out low quality data, and finally generate the question-answer (QA) pairs with GPT-4o. This approach allows us to streamline the generation of fine-tuning datasets and introduce the effective chart dataset (ECD), which contains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring 250+ chart type combinations with high visual complexity. We show that ECD consistently improves the performance of various MLLMs on a range of real-world and synthetic test sets. Code, data and models are available at: https://github.com/yuweiyang-anu/ECD.","short_abstract":"Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30%-50% on challenging benchmarks. Pre...","url_abs":"https://arxiv.org/abs/2508.06492","url_pdf":"https://arxiv.org/pdf/2508.06492v1","authors":"[\"Yuwei Yang\",\"Zeyu Zhang\",\"Yunzhong Hou\",\"Zhuowan Li\",\"Gaowen Liu\",\"Ali Payani\",\"Yuan-Sen Ting\",\"Liang Zheng\"]","published":"2025-08-08T17:59:10Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":611129,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2884859,"paper_url":"https://arxiv.org/abs/2508.06492","paper_title":"Effective Training Data Synthesis for Improving MLLM Chart Understanding","repo_url":"https://github.com/yuweiyang-anu/ECD","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
