{"ID":2879665,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.15213","arxiv_id":"2508.15213","title":"Select to Know: An Internal-External Knowledge Self-Selection Framework for Domain-Specific Question Answering","abstract":"Large Language Models (LLMs) perform well in general QA but often struggle in domain-specific scenarios. Retrieval-Augmented Generation (RAG) introduces external knowledge but suffers from hallucinations and latency due to noisy retrievals. Continued pretraining internalizes domain knowledge but is costly and lacks cross-domain flexibility. We attribute this challenge to the long-tail distribution of domain knowledge, which leaves partial yet useful internal knowledge underutilized. We further argue that knowledge acquisition should be progressive, mirroring human learning: first understanding concepts, then applying them to complex reasoning. To address this, we propose Selct2Know (S2K), a cost-effective framework that internalizes domain knowledge through an internal-external knowledge self-selection strategy and selective supervised fine-tuning. We also introduce a structured reasoning data generation pipeline and integrate GRPO to enhance reasoning ability. Experiments on medical, legal, and financial QA benchmarks show that S2K consistently outperforms existing methods and matches domain-pretrained LLMs with significantly lower cost.","short_abstract":"Large Language Models (LLMs) perform well in general QA but often struggle in domain-specific scenarios. Retrieval-Augmented Generation (RAG) introduces external knowledge but suffers from hallucinations and latency due to noisy retrievals. Continued pretraining internalizes domain knowledge but is costly and lacks cro...","url_abs":"https://arxiv.org/abs/2508.15213","url_pdf":"https://arxiv.org/pdf/2508.15213v2","authors":"[\"Bolei He\",\"Xinran He\",\"Run Shao\",\"Shanfu Shu\",\"Xianwei Xue\",\"Mingquan Cheng\",\"Haifeng Li\",\"Zhenhua Ling\"]","published":"2025-08-21T03:53:35Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"RAG\",\"Large Language Model\",\"Language Model\"]","has_code":false}
