{"ID":2886532,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.03828","arxiv_id":"2508.03828","title":"MegaWika 2: A More Comprehensive Multilingual Collection of Articles and their Sources","abstract":"We introduce MegaWika 2, a large, multilingual dataset of Wikipedia articles with their citations and scraped web sources; articles are represented in a rich data structure, and scraped source texts are stored inline with precise character offsets of their citations in the article text. MegaWika 2 is a major upgrade from the original MegaWika, spanning six times as many articles and twice as many fully scraped citations. Both MegaWika and MegaWika 2 support report generation research ; whereas MegaWika also focused on supporting question answering and retrieval applications, MegaWika 2 is designed to support fact checking and analyses across time and language.","short_abstract":"We introduce MegaWika 2, a large, multilingual dataset of Wikipedia articles with their citations and scraped web sources; articles are represented in a rich data structure, and scraped source texts are stored inline with precise character offsets of their citations in the article text. MegaWika 2 is a major upgrade fr...","url_abs":"https://arxiv.org/abs/2508.03828","url_pdf":"https://arxiv.org/pdf/2508.03828v1","authors":"[\"Samuel Barham\",\"Chandler May\",\"Benjamin Van Durme\"]","published":"2025-08-05T18:18:17Z","proceeding":"cs.DL","tasks":"[\"cs.DL\",\"cs.CL\"]","methods":"[]","has_code":false}