{"ID":2856065,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.10951","arxiv_id":"2510.10951","title":"Punctuation-aware treebank tree binarization","abstract":"This article presents a curated resource and evaluation suite for punctuation-aware treebank binarization. Standard binarization pipelines drop punctuation before head selection, which alters constituent shape and harms head-child identification. We release (1) a reproducible pipeline that preserves punctuation as sibling nodes prior to binarization, (2) derived artifacts and metadata (intermediate @X markers, reversibility signatures, alignment indices), and (3) an accompanying evaluation suite covering head-child prediction, round-trip reversibility, and structural compatibility with derivational resources (CCGbank). On the Penn Treebank, punctuation-aware preprocessing improves head prediction accuracy from 73.66\\% (Collins rules) and 86.66\\% (MLP) to 91.85\\% with the same classifier, and achieves competitive alignment against CCGbank derivations. All code, configuration files, and documentation are released to enable replication and extension to other corpora.","short_abstract":"This article presents a curated resource and evaluation suite for punctuation-aware treebank binarization. Standard binarization pipelines drop punctuation before head selection, which alters constituent shape and harms head-child identification. We release (1) a reproducible pipeline that preserves punctuation as sibl...","url_abs":"https://arxiv.org/abs/2510.10951","url_pdf":"https://arxiv.org/pdf/2510.10951v1","authors":"[\"Eitan Klinger\",\"Vivaan Wadhwa\",\"Jungyeul Park\"]","published":"2025-10-13T03:02:38Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[]","has_code":false}
