{"ID":2921832,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T05:56:00.181519634Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01394","arxiv_id":"2606.01394","title":"UniD$^3$: A Knowledge Graph-Enhanced RAG Framework for Drug-Disease Discovery and Reasoning","abstract":"Systematic characterization of drug-disease relationships is essential for drug discovery and repurposing, yet is hindered by the heterogeneity and rapid growth of biomedical literature. Existing datasets rely on labor-intensive curation and are often incomplete, while LLM-only approaches suffer from hallucination and weak evidence grounding. We introduce UniD$^3$, a unified framework that integrates Large Language Models with Knowledge Graph-enhanced Retrieval-Augmented Generation (KG-RAG) to extract, organize, and validate drug-disease knowledge across Drug-Disease Matching (DDM), Drug Effectiveness Assessment (DEA), and Drug-Target Analysis (DTA). UniD$^3$ processes 157,849 PubMed articles with Llama 3.3-70B and constructs knowledge graphs via a dual-stage strategy combining paper-level extraction with KG-level consolidation centered on drug and disease entities. These graphs support KG-RAG-based generation of structured datasets, evaluated through external benchmarks, fuzzy matching with curated resources, and clinician review. UniD$^3$ produces six knowledge graphs and large-scale datasets, including 28,915 DDM, 15,042 DEA, and over 4,000 DTA QA pairs. External validation shows strong performance (F1: 0.85-0.87 for DDM/DEA; 0.82 for DTA), with clinician review confirming high reliability (AUROC = 0.90). KG-RAG-augmented models outperform standalone LLMs, and the UniD$^3$ chatbot enables interpretable, citation-supported exploration of drug-disease relationships. UniD$^3$ provides a scalable, extensible framework for transforming unstructured biomedical literature into high-quality, structured drug-disease knowledge, supporting AI-driven discovery, repurposing, and precision medicine.","short_abstract":"Systematic characterization of drug-disease relationships is essential for drug discovery and repurposing, yet is hindered by the heterogeneity and rapid growth of biomedical literature. Existing datasets rely on labor-intensive curation and are often incomplete, while LLM-only approaches suffer from hallucination and...","url_abs":"https://arxiv.org/abs/2606.01394","url_pdf":"https://arxiv.org/pdf/2606.01394v1","authors":"[\"Qing Wang\",\"Tianshi Liu\",\"Minghao Zhou\",\"Jialu Liang\",\"Sen Guo\",\"Guangyu Wang\",\"Jing Su\",\"Qianqian Song\"]","published":"2026-05-31T18:36:41Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"RAG\",\"Large Language Model\",\"Language Model\",\"LoRA\",\"Generative Adversarial Network\"]","has_code":false}
