{"ID":2838959,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.16134","arxiv_id":"2511.16134","title":"Benchmarking Table Extraction from Heterogeneous Scientific Extraction Documents","abstract":"Table Extraction (TE) consists in extracting tables from PDF documents, in a structured format which can be automatically processed. While numerous TE tools exist, the variety of methods and techniques makes it difficult for users to choose an appropriate one. We propose a novel benchmark for assessing end-to-end TE methods (from PDF to the final table). We contribute an analysis of TE evaluation metrics, and the design of a rigorous evaluation process, which allows scoring each TE sub-task as well as end-to-end TE, and captures model uncertainty. Along with a prior dataset, our benchmark comprises two new heterogeneous datasets of 37k samples. We run our benchmark on diverse models, including off-the-shelf libraries, software tools, large vision language models, and approaches based on computer vision. The results demonstrate that TE remains challenging: current methods suffer from a lack of generalizability when facing heterogeneous data, and from limitations in robustness and interpretability.","short_abstract":"Table Extraction (TE) consists in extracting tables from PDF documents, in a structured format which can be automatically processed. While numerous TE tools exist, the variety of methods and techniques makes it difficult for users to choose an appropriate one. We propose a novel benchmark for assessing end-to-end TE me...","url_abs":"https://arxiv.org/abs/2511.16134","url_pdf":"https://arxiv.org/pdf/2511.16134v1","authors":"[\"Marijan Soric\",\"Cécile Gracianne\",\"Ioana Manolescu\",\"Pierre Senellart\"]","published":"2025-11-20T08:09:48Z","proceeding":"cs.DB","tasks":"[\"cs.DB\"]","methods":"[\"Language Model\"]","has_code":false}
