{"ID":2868331,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.16531","arxiv_id":"2509.16531","title":"Leveraging Multilingual Training for Authorship Representation: Enhancing Generalization across Languages and Domains","abstract":"Authorship representation (AR) learning, which models an author's unique writing style, has demonstrated strong performance in authorship attribution tasks. However, prior research has primarily focused on monolingual settings-mostly in English-leaving the potential benefits of multilingual AR models underexplored. We introduce a novel method for multilingual AR learning that incorporates two key innovations: probabilistic content masking, which encourages the model to focus on stylistically indicative words rather than content-specific words, and language-aware batching, which improves contrastive learning by reducing cross-lingual interference. Our model is trained on over 4.5 million authors across 36 languages and 13 domains. It consistently outperforms monolingual baselines in 21 out of 22 non-English languages, achieving an average Recall@8 improvement of 4.85%, with a maximum gain of 15.91% in a single language. Furthermore, it exhibits stronger cross-lingual and cross-domain generalization compared to a monolingual model trained solely on English. Our analysis confirms the effectiveness of both proposed techniques, highlighting their critical roles in the model's improved performance.","short_abstract":"Authorship representation (AR) learning, which models an author's unique writing style, has demonstrated strong performance in authorship attribution tasks. However, prior research has primarily focused on monolingual settings-mostly in English-leaving the potential benefits of multilingual AR models underexplored. We...","url_abs":"https://arxiv.org/abs/2509.16531","url_pdf":"https://arxiv.org/pdf/2509.16531v1","authors":"[\"Junghwan Kim\",\"Haotian Zhang\",\"David Jurgens\"]","published":"2025-09-20T04:43:24Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[]","has_code":false}
