{"ID":2843428,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.08283","arxiv_id":"2511.08283","title":"DiagramIR: An Automatic Pipeline for Educational Math Diagram Evaluation","abstract":"Large Language Models (LLMs) are increasingly being adopted as tools for learning; however, most tools remain text-only, limiting their usefulness for domains where visualizations are essential, such as mathematics. Recent work shows that LLMs are capable of generating code that compiles to educational figures, but a major bottleneck remains: scalable evaluation of these diagrams. We address this by proposing DiagramIR: an automatic and scalable evaluation pipeline for geometric figures. Our method relies on intermediate representations (IRs) of LaTeX TikZ code. We compare our pipeline to other evaluation baselines such as LLM-as-a-Judge, showing that our approach has higher agreement with human raters. This evaluation approach also enables smaller models like GPT-4.1-Mini to perform comparably to larger models such as GPT-5 at a 10x lower inference cost, which is important for deploying accessible and scalable education technologies.","short_abstract":"Large Language Models (LLMs) are increasingly being adopted as tools for learning; however, most tools remain text-only, limiting their usefulness for domains where visualizations are essential, such as mathematics. Recent work shows that LLMs are capable of generating code that compiles to educational figures, but a m...","url_abs":"https://arxiv.org/abs/2511.08283","url_pdf":"https://arxiv.org/pdf/2511.08283v1","authors":"[\"Vishal Kumar\",\"Shubhra Mishra\",\"Rebecca Hao\",\"Rizwaan Malik\",\"David Broman\",\"Dorottya Demszky\"]","published":"2025-11-11T14:11:56Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}