{"ID":2862345,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.01470","arxiv_id":"2510.01470","title":"Extracting O*NET Features from the NLx Corpus to Build Public Use Aggregate Labor Market Data","abstract":"Data from online job postings are difficult to access and are not built in a standard or transparent manner. Data included in the standard taxonomy and occupational information database (O*NET) are updated infrequently and based on small survey samples. We adopt O*NET as a framework for building natural language processing tools that extract structured information from job postings. We publish the Job Ad Analysis Toolkit (JAAT), a collection of open-source tools built for this purpose, and demonstrate its reliability and accuracy in out-of-sample and LLM-as-a-Judge testing. We extract more than 10 billion data points from more than 155 million online job ads provided by the National Labor Exchange (NLx) Research Hub, including O*NET tasks, occupation codes, tools, and technologies, as well as wages, skills, industry, and more features. We describe the construction of a dataset of occupation, state, and industry level features aggregated by monthly active jobs from 2015 - 2025. We illustrate the potential for research and future uses in education and workforce development.","short_abstract":"Data from online job postings are difficult to access and are not built in a standard or transparent manner. Data included in the standard taxonomy and occupational information database (O*NET) are updated infrequently and based on small survey samples. We adopt O*NET as a framework for building natural language proces...","url_abs":"https://arxiv.org/abs/2510.01470","url_pdf":"https://arxiv.org/pdf/2510.01470v1","authors":"[\"Stephen Meisenbacher\",\"Svetlozar Nestorov\",\"Peter Norlander\"]","published":"2025-10-01T21:27:11Z","proceeding":"cs.CY","tasks":"[\"cs.CY\",\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}
