{"ID":2861614,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.02241","arxiv_id":"2510.02241","title":"Study on LLMs for Promptagator-Style Dense Retriever Training","abstract":"Promptagator demonstrated that Large Language Models (LLMs) with few-shot prompts can be used as task-specific query generators for fine-tuning domain-specialized dense retrieval models. However, the original Promptagator approach relied on proprietary and large-scale LLMs which users may not have access to or may be prohibited from using with sensitive data. In this work, we study the impact of open-source LLMs at accessible scales ($\\leq$14B parameters) as an alternative. Our results demonstrate that open-source LLMs as small as 3B parameters can serve as effective Promptagator-style query generators. We hope our work will inform practitioners with reliable alternatives for synthetic data generation and give insights to maximize fine-tuning results for domain-specific applications.","short_abstract":"Promptagator demonstrated that Large Language Models (LLMs) with few-shot prompts can be used as task-specific query generators for fine-tuning domain-specialized dense retrieval models. However, the original Promptagator approach relied on proprietary and large-scale LLMs which users may not have access to or may be p...","url_abs":"https://arxiv.org/abs/2510.02241","url_pdf":"https://arxiv.org/pdf/2510.02241v1","authors":"[\"Daniel Gwon\",\"Nour Jedidi\",\"Jimmy Lin\"]","published":"2025-10-02T17:29:51Z","proceeding":"cs.IR","tasks":"[\"cs.IR\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}