{"ID":2830746,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.09701","arxiv_id":"2512.09701","title":"FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text","abstract":"We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides per-character statistics with aggregate and year-level frequencies, allowing fine-grained temporal analysis. The dataset preserves naturally occurring multilingual features such as cross-script borrowings, emoji, and acronyms without applying artificial filtering. Each character entry includes Unicode metadata (category, script, block), enabling domain-specific or other downstream filtering and analysis. The full dataset is released in both CSV and Parquet formats, with associated metadata, available on GitHub and HuggingFace. https://github.com/Bin-2/FineFreq","short_abstract":"We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides p...","url_abs":"https://arxiv.org/abs/2512.09701","url_pdf":"https://arxiv.org/pdf/2512.09701v2","authors":"[\"Binbin Xu\"]","published":"2025-12-10T14:49:59Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[]","has_code":false,"code_links":[{"ID":606072,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2830746,"paper_url":"https://arxiv.org/abs/2512.09701","paper_title":"FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text","repo_url":"https://github.com/Bin-2/FineFreq","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
