{"ID":2921644,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T05:56:00.181519634Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01109","arxiv_id":"2606.01109","title":"Digging Up Citations: FOSSIL, a Dataset and Workflow for Reference Extraction in Law and the Humanities","abstract":"Citation extraction tools are designed for the structured end-of-document bibliographies of the natural sciences, but law and humanities scholarship cites references primarily in footnotes, where bibliographic data is interleaved with commentary and cross-references and varies widely across languages and styles. To address the scarcity of suitable gold-standard resources, we present FOSSIL (Footnote-based Open-access SSH Scientific Instance Labels), an openly licensed multilingual dataset of 96 annotated scholarly articles containing over 7,600 footnote-embedded references, together with PDF-TEI Editor (a collaborative web annotation tool), a documented seven-annotator workflow, and a Grobid specialization for footnote-based citations. In end-to-end evaluation, the specialized pipeline nearly doubles extraction quality over default Grobid (micro-F1 from 0.36 to 0.72), driven largely by improved recall, while showing that substantial headroom remains for cross-references and mixed-content footnotes. This extended abstract presents work in progress; annotations of citations segmentation and parsing, and cross-reference resolution are ongoing.","short_abstract":"Citation extraction tools are designed for the structured end-of-document bibliographies of the natural sciences, but law and humanities scholarship cites references primarily in footnotes, where bibliographic data is interleaved with commentary and cross-references and varies widely across languages and styles. To add...","url_abs":"https://arxiv.org/abs/2606.01109","url_pdf":"https://arxiv.org/pdf/2606.01109v1","authors":"[\"Luca Foppiano\",\"Christian Boulanger\"]","published":"2026-05-31T08:59:49Z","proceeding":"cs.DL","tasks":"[\"cs.DL\",\"cs.CL\"]","methods":"[]","has_code":false}
