{"ID":2883932,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.09220","arxiv_id":"2508.09220","title":"Towards Scalable Training for Handwritten Mathematical Expression Recognition","abstract":"Large foundation models have achieved significant performance gains through scalable training on massive datasets. However, the field of \\textbf{H}andwritten \\textbf{M}athematical \\textbf{E}xpression \\textbf{R}ecognition (HMER) has been impeded by the scarcity of data, primarily due to the arduous and costly process of manual annotation. To bridge this gap, we propose a novel method integrating limited handwritten formulas with large-scale LaTeX-rendered formulas by developing a scalable data engine to generate complex and consistent LaTeX sequences. With this engine, we built the largest formula dataset to date, termed \\texttt{Tex80M}, comprising over 80 million high-quality training instances. Then we propose \\texttt{TexTeller}, the first HMER model trained at scale, by mix-training \\texttt{Tex80M} with a relatively small HME dataset. The expansive training dataset and our refined pipeline have equipped \\texttt{TexTeller} with state-of-the-art (SOTA) performance across nearly all benchmarks. To advance the field, we will openly release our complete model, entire dataset, and full codebase, enabling further research building upon our contributions.","short_abstract":"Large foundation models have achieved significant performance gains through scalable training on massive datasets. However, the field of \\textbf{H}andwritten \\textbf{M}athematical \\textbf{E}xpression \\textbf{R}ecognition (HMER) has been impeded by the scarcity of data, primarily due to the arduous and costly process of...","url_abs":"https://arxiv.org/abs/2508.09220","url_pdf":"https://arxiv.org/pdf/2508.09220v4","authors":"[\"Haoyang Li\",\"Jiaqing Li\",\"Jialun Cao\",\"Zongyuan Yang\",\"Yongping Xiong\"]","published":"2025-08-11T19:10:34Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[]","has_code":false}
