{"ID":2865146,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.22030","arxiv_id":"2509.22030","title":"From Outliers to Topics in Language Models: Anticipating Trends in News Corpora","abstract":"This paper examines how outliers, often dismissed as noise in topic modeling, can act as weak signals of emerging topics in dynamic news corpora. Using vector embeddings from state-of-the-art language models and a cumulative clustering approach, we track their evolution over time in French and English news datasets focused on corporate social responsibility and climate change. The results reveal a consistent pattern: outliers tend to evolve into coherent topics over time across both models and languages.","short_abstract":"This paper examines how outliers, often dismissed as noise in topic modeling, can act as weak signals of emerging topics in dynamic news corpora. Using vector embeddings from state-of-the-art language models and a cumulative clustering approach, we track their evolution over time in French and English news datasets foc...","url_abs":"https://arxiv.org/abs/2509.22030","url_pdf":"https://arxiv.org/pdf/2509.22030v1","authors":"[\"Evangelia Zve\",\"Benjamin Icard\",\"Alice Breton\",\"Lila Sainero\",\"Gauvain Bourgne\",\"Jean-Gabriel Ganascia\"]","published":"2025-09-26T08:07:02Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false}
