{"ID":2835117,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.23714","arxiv_id":"2512.23714","title":"PharmaShip: An Entity-Centric, Reading-Order-Supervised Benchmark for Chinese Pharmaceutical Shipping Documents","abstract":"We present PharmaShip, a real-world Chinese dataset of scanned pharmaceutical shipping documents designed to stress-test pre-trained text-layout models under noisy OCR and heterogeneous templates. PharmaShip covers three complementary tasks-sequence entity recognition (SER), relation extraction (RE), and reading order prediction (ROP)-and adopts an entity-centric evaluation protocol to minimize confounds across architectures. We benchmark five representative baselines spanning pixel-aware and geometry-aware families (LiLT, LayoutLMv3-base, GeoLayoutLM and their available RORE-enhanced variants), and standardize preprocessing, splits, and optimization. Experiments show that pixels and explicit geometry provide complementary inductive biases, yet neither alone is sufficient: injecting reading-order-oriented regularization consistently improves SER and EL and yields the most robust configuration, while longer positional coverage stabilizes late-page predictions and reduces truncation artifacts. ROP is accurate at the word level but challenging at the segment level, reflecting boundary ambiguity and long-range crossings. PharmaShip thus establishes a controlled, reproducible benchmark for safety-critical document understanding in the pharmaceutical domain and highlights sequence-aware constraints as a transferable bias for structure modeling. We release the dataset at https://github.com/KevinYuLei/PharmaShip.","short_abstract":"We present PharmaShip, a real-world Chinese dataset of scanned pharmaceutical shipping documents designed to stress-test pre-trained text-layout models under noisy OCR and heterogeneous templates. PharmaShip covers three complementary tasks-sequence entity recognition (SER), relation extraction (RE), and reading order...","url_abs":"https://arxiv.org/abs/2512.23714","url_pdf":"https://arxiv.org/pdf/2512.23714v1","authors":"[\"Tingwei Xie\",\"Tianyi Zhou\",\"Yonghong Song\"]","published":"2025-11-29T06:55:45Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[]","has_code":false,"code_links":[{"ID":606488,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2835117,"paper_url":"https://arxiv.org/abs/2512.23714","paper_title":"PharmaShip: An Entity-Centric, Reading-Order-Supervised Benchmark for Chinese Pharmaceutical Shipping Documents","repo_url":"https://github.com/KevinYuLei/PharmaShip","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
