{"ID":2828755,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.12938","arxiv_id":"2512.12938","title":"SPAR: Session-based Pipeline for Adaptive Retrieval on Legacy File Systems","abstract":"The ability to extract value from historical data is essential for enterprise decision-making. However, much of this information remains inaccessible within large legacy file systems that lack structured organization and semantic indexing, making retrieval and analysis inefficient and error-prone. We introduce SPAR (Session-based Pipeline for Adaptive Retrieval), a conceptual framework that integrates Large Language Models (LLMs) into a Retrieval-Augmented Generation (RAG) architecture specifically designed for legacy enterprise environments. Unlike conventional RAG pipelines, which require costly construction and maintenance of full-scale vector databases that mirror the entire file system, SPAR employs a lightweight two-stage process: a semantic Metadata Index is first created, after which session-specific vector databases are dynamically generated on demand. This design reduces computational overhead while improving transparency, controllability, and relevance in retrieval. We provide a theoretical complexity analysis comparing SPAR with standard LLM-based RAG pipelines, demonstrating its computational advantages. To validate the framework, we apply SPAR to a synthesized enterprise-scale file system containing a large corpus of biomedical literature, showing improvements in both retrieval effectiveness and downstream model accuracy. Finally, we discuss design trade-offs and outline open challenges for deploying SPAR across diverse enterprise settings.","short_abstract":"The ability to extract value from historical data is essential for enterprise decision-making. However, much of this information remains inaccessible within large legacy file systems that lack structured organization and semantic indexing, making retrieval and analysis inefficient and error-prone. We introduce SPAR (Se...","url_abs":"https://arxiv.org/abs/2512.12938","url_pdf":"https://arxiv.org/pdf/2512.12938v1","authors":"[\"Duy A. Nguyen\",\"Hai H. Do\",\"Minh Doan\",\"Minh N. Do\"]","published":"2025-12-15T02:54:10Z","proceeding":"cs.IR","tasks":"[\"cs.IR\"]","methods":"[\"RAG\",\"Large Language Model\",\"Language Model\",\"Generative Adversarial Network\"]","has_code":false}
