{"ID":2840105,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.14598","arxiv_id":"2511.14598","title":"Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages","abstract":"High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.","short_abstract":"High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Fron...","url_abs":"https://arxiv.org/abs/2511.14598","url_pdf":"https://arxiv.org/pdf/2511.14598v1","authors":"[\"Noam Dahan\",\"Omer Kidron\",\"Gabriel Stanovsky\"]","published":"2025-11-18T15:39:48Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[]","has_code":false}
