{"ID":2887053,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.05672","arxiv_id":"2508.05672","title":"LMAR: Language Model Augmented Retriever for Domain-specific Knowledge Indexing","abstract":"Retrieval Augmented Generation (RAG) systems often struggle with domain-specific knowledge due to performance deterioration of pre-trained embeddings and prohibitive computational costs of large language model (LLM)-based retrievers. While fine-tuning data augmentation embedding models offers a promising direction, its effectiveness is limited by the need for high-quality training data and reliable chunking strategies that preserve contextual integrity. We propose LMAR (Language Model Augmented Retriever), a model-agnostic framework that addresses these challenges by combining LLM-guided data synthesis with contrastive embedding adaptation and efficient text clustering. LMAR consists of a two-stage pipeline: (1) Triplet sampling and synthetic data augmentation, where LLMs act as both labeler and validator to ensure high-fidelity supervision throughout the pipeline. Experimental results across multiple domain-specific benchmark datasets demonstrate that LMAR outperforms multiple baseline models, while maintaining moderate hardware requirements and low latency. Its model-agnostic nature further enables seamless integration with emerging RAG architectures and text embedding models, ensuring continual improvements without redesigning the pipeline. These results highlight LMAR as a practical and cost-effective solution for scalable domain-specific adaptation.","short_abstract":"Retrieval Augmented Generation (RAG) systems often struggle with domain-specific knowledge due to performance deterioration of pre-trained embeddings and prohibitive computational costs of large language model (LLM)-based retrievers. While fine-tuning data augmentation embedding models offers a promising direction, its...","url_abs":"https://arxiv.org/abs/2508.05672","url_pdf":"https://arxiv.org/pdf/2508.05672v2","authors":"[\"Yao Zhao\",\"Yantian Ding\",\"Zhiyue Zhang\",\"Dapeng Yao\",\"Yanxun Xu\"]","published":"2025-08-04T16:59:43Z","proceeding":"cs.IR","tasks":"[\"cs.IR\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}