{"ID":2896041,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.07582","arxiv_id":"2507.07582","title":"Improving Clustering on Occupational Text Data through Dimensionality Reduction","abstract":"In this study, we focused on proposing an optimal clustering mechanism for the occupations defined in the well-known US-based occupational database, O*NET. Even though all occupations are defined according to well-conducted surveys in the US, their definitions can vary for different firms and countries. Hence, if one wants to expand the data that is already collected in O*NET for the occupations defined with different tasks, a map between the definitions will be a vital requirement. We proposed a pipeline using several BERT-based techniques with various clustering approaches to obtain such a map. We also examined the effect of dimensionality reduction approaches on several metrics used in measuring performance of clustering algorithms. Finally, we improved our results by using a specialized silhouette approach. This new clustering-based mapping approach with dimensionality reduction may help distinguish the occupations automatically, creating new paths for people wanting to change their careers.","short_abstract":"In this study, we focused on proposing an optimal clustering mechanism for the occupations defined in the well-known US-based occupational database, O*NET. Even though all occupations are defined according to well-conducted surveys in the US, their definitions can vary for different firms and countries. Hence, if one w...","url_abs":"https://arxiv.org/abs/2507.07582","url_pdf":"https://arxiv.org/pdf/2507.07582v1","authors":"[\"Iago Xabier Vázquez García\",\"Damla Partanaz\",\"Emrullah Fatih Yetkin\"]","published":"2025-07-10T09:36:54Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\",\"cs.CY\"]","methods":"[]","has_code":false}
