{"ID":2921831,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T05:56:00.181519634Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01393","arxiv_id":"2606.01393","title":"Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing","abstract":"Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, existing Optical Character Recognition (OCR) and document parsing benchmarks are increasingly limited in coverage and difficulty: many focus on common document genres or uniformly sampled pages where modern parsers already perform strongly, while offering limited annotation for expert-domain structures such as chemical formula, music notation, complex tables, and cross-page layouts. We introduce Dr. DocBench, a difficulty-aware benchmark for expert-level document parsing. Built from a large-scale multilingual book corpus, Dr. DocBench spans 52 BISAC subject domains and selects challenging documents through parser-failure-based sampling, targeting cases where multiple state-of-the-art systems struggle. It contains 4,514 annotated pages from long documents averaging around 100 pages, with 65k high-quality page- and block-level annotations for layout, reading order, hierarchical relations, and domain-specific visual contents. Evaluations of pipeline-based parsers and general-purpose VLMs show that strong performance on existing benchmarks does not transfer to our expert-level document parsing. Our analysis reveals substantial failures across subjects, content types, and structural attributes, highlighting Dr. DocBench as a comprehensive testbed for diagnosing and advancing document intelligence.","short_abstract":"Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, existing Optical Character Recognition (OCR) and document parsing benchmarks are increasingly limited in coverage and difficulty: many focus on common document genres or uniformly sa...","url_abs":"https://arxiv.org/abs/2606.01393","url_pdf":"https://arxiv.org/pdf/2606.01393v1","authors":"[\"Minglai Yang\",\"Xinyan Velocity Yu\",\"Pengyuan Li\",\"Xinyu Guo\",\"Zhenting Qi\",\"Konwoo Kim\",\"Longtian Ye\",\"Xiaolong Luo\",\"Jinhe Bi\",\"Henry Zhang\",\"Haris Riaz\",\"Xuan Zhang\",\"Yunze Xiao\",\"Bangya Liu\",\"Tom Tang\",\"Yunfei Zhao\",\"Qunshu Lin\",\"Zihan Wang\",\"Minghao Liu\",\"Michael Lingzhi Li\",\"Yilun Du\",\"Jesse Thomason\",\"Rogerio Feris\",\"Alex Pentland\",\"Zexue He\"]","published":"2026-05-31T18:35:30Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false}
