{"ID":2893883,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.12064","arxiv_id":"2507.12064","title":"StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features","abstract":"This submission to the binary AI detection task is based on a modular stylometric pipeline, where: public spaCy models are used for text preprocessing (including tokenisation, named entity recognition, dependency parsing, part-of-speech tagging, and morphology annotation) and extracting several thousand features (frequencies of n-grams of the above linguistic annotations); light-gradient boosting machines are used as the classifier. We collect a large corpus of more than 500 000 machine-generated texts for the classifier's training. We explore several parameter options to increase the classifier's capacity and take advantage of that training set. Our approach follows the non-neural, computationally inexpensive but explainable approach found effective previously.","short_abstract":"This submission to the binary AI detection task is based on a modular stylometric pipeline, where: public spaCy models are used for text preprocessing (including tokenisation, named entity recognition, dependency parsing, part-of-speech tagging, and morphology annotation) and extracting several thousand features (frequ...","url_abs":"https://arxiv.org/abs/2507.12064","url_pdf":"https://arxiv.org/pdf/2507.12064v1","authors":"[\"Jeremi K. Ochab\",\"Mateusz Matias\",\"Tymoteusz Boba\",\"Tomasz Walkowiak\"]","published":"2025-07-16T09:21:20Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[]","has_code":false}
